hermanschaaf / chinese-ime

A JavaScript jQuery plugin for building Chinese keyboard input capabilities natively into a website
GNU Lesser General Public License v3.0
56 stars 23 forks source link

Copy all Chinese Characters from pinyin #5

Closed shellwe closed 7 years ago

shellwe commented 7 years ago

First of all, this isn't really an issue, it is a question I just didn't know where else to ask it. I am an American (so I don't know Chinese at all) and I am collaborating with a PHD student where part of our project would be to type some pinyin and then grab all of the Chinese character outputs and copy them to an array instead of displaying them so that I can then perform additional functions on each of the potential characters. Is that possible to do with this API?

My alternative is Dragon Mapper as that seems to be the most popular, I am just not sure if that gets the output I need.

tsroten commented 7 years ago

Hi @shellwe! Chinese IME takes Pinyin input and hits Google to retrieve all possible Chinese character outputs. Out of the box, I don't believe it supports what you want, but I'd imagine that if you're willing to code a tiny bit of JavaScript, you could easily modify it to get Google's response, and then do something with the array of characters.

I'm the creator of Dragon Mapper -- out of the box, Dragon Mapper doesn't do this either. If you're wanting to handle this is on the server-side, rather than on the client-side in JavaScript, you could use Dragon Mapper's data. Take a look at the _load_data function. It is mapping characters to possible Pinyin readings. But, you could easily create a function that did the reverse. You'd probably want to strip accent marks from the Pinyin readings when doing this.

shellwe commented 7 years ago

Wow, a man of many talents! Basically my goal is to take the pinyin someone typed and display the potential Chinese characters and take each of those and run those through a search for images and a search for the audio. It is an educational tool to teach people Chinese Characters who know the pinyin. So with that I don't know if needing to handle it server side is necessary. I haven't explored TOO much into it but if I will be just taking the Chinese Characters and running them through other external APIs. Unless there is something I am not foreseeing this little web API may be a better solution. We are still in the exploratory phase so there are so many unknowns.

tsroten commented 7 years ago

That sounds like fun! Good luck!

By the way, while I'm the creator of Dragon Mapper, I didn't create Chinese IME -- that was all @hermanschaaf.

nahanil commented 7 years ago

This sounds like a fun project indeed. I'm veering off the topic of the original question, but for audio you may wanna check out shtooka. There are a few chinese language audio packs you can download, and I think between them most valid pinyin syllable/tone combinations are covered.

http://shtooka.net/download.php

shellwe commented 7 years ago

@texh, please feel free to offer input! At this time we are just exploring and I am having trouble finding a free-to-use Chinese character to to Audio converter. That's impressive there is a free collection of 8500 audio clips in Chinese, I am not supposed to download stuff to my work computer but I'll check it out when I get home. My curiosity is if it is actual Chinese Characters or the pinyin. I would want the Chinese Characters. Also, sadly, the documentation is in French and while I could use google translate I am not sure how well that would translate. If I use this tool and can hit external APIs I will have no need for a back end at all... which would be nice. That and I have never used Python (I am a .NET developer normally) so Dragon Mapper, while powerful, sounds a little overwhelming at this point. Any suggestions for a good image repository with API I could pull images into my site with? Basically when the user clicks on a Chinese character I want to search some repository of images and come back with results. Imgur had a promising API but their search results were kinda weak where as Google actually had some really solid search results but as far as I could see they wanted you to view the search results on their custom google page where I just want to pull images over to my site.

nahanil commented 7 years ago

Looks like all the linked shtooka files are flac, but if you replace it in the url with mp3 you get much smaller, and probably more useful, download files. ie http://download.shtooka.net/cmn-caen-tan_mp3.tar

There's a file in the archive called 'index.xml' that maps data to the ugly filenames. Many moons ago I wrote a crude perl script to rename the files using the pinyin listed in here - ie the below example now exists on my filesystem somewhere as āi.mp3. Converting to ai1.mp3 would be possible but tricky - I guess it really only gets complicated with compound words.

<file path="cmn-0004bca7.mp3">
  <tag 
    swac_alphaidx="āi"
    swac_coll_section="HSK niveau II"
    swac_pron_phon="āi"
    swac_tech_date="2009-07-08"
    swac_text="哎" />
</file>

The big archive appears to include most/all words from the HSK exam word lists. So you'll find longer stuff like 公共汽車 etc as well as single character words. Combined with the other downloads you should end up with most single valid syllables.

Having an audio file per Chinese character seems kinda counter intuitive for several reasons. Many Chinese characters share the same pronunciation, their meaning being derived from the context in which they're used. You could end up having dozens of audio files with the same actual content, just a different file name (ie, 愛, 璦, 礙 all appear at a glance to be pronounced the same).

Also some characters/words pronunciation vary depending on usage. Take 行 for example. It can be pronounced xing2, hang2, and xing4 depending on where/how it's used. IMHO finding the audio based on the pinyin makes more sense.

All this being said, you could probably figure out a way to use Google's/another TTS API to handle the audio, but I doubt the quality would be as good as that of native speakers. There are probably also other audio packs out there with varying licenses.

As for images you could create an API of your own, or maybe even (mis)utilise something like http://placehold.it to generate them for you. Can't guarantee their system would support some less common/obscure characters (Though it does appear to support the Five Great Profanities). An example of this would be https://placehold.it/150/ffffff?text=你 Not sure what the terms/licensing is like for this API.

Just some more food for thought :)

tsroten commented 7 years ago

@texh is spot on about the audio -- it's far easier to handle it based on the Pinyin than the characters. It's less error prone, faster, more consistent, and handles storage/memory issues better.

Pinyin Toolkit is a popular plugin for Anki. It uses Pinyin-based audio files. They use the Creative Commons-licensed Mandarin Sounds from Chinese-Lessons. MDBG also uses the same audio files.

nahanil commented 7 years ago

@tsroten's suggestion looks much easier to use actually. All the files are already named more sanely (ie. zun1.mp3) so there's no data munging to do. I think the only reason that I left these behind originally was to handle multisyllabic words and phrases, namely the words from the HSK exam.

On the topic of imagery, Make Me A Hanzi has some animated SVG charater/stroke images. I'm not sure how accurate they all are, but it might be something to keep in mind.

I'm using Hanzi Writer in a few places as well to show similar stroke order animations in web pages but these don't have pre-exported image files, and I'm not sure how easy it would be to pre-generate them.

shellwe commented 7 years ago

Hello all! Sorry for several months of radio silence. I am FINALLY getting back to this project. I have explained things poorly and since we last chatted I had discovered there is an swf that goes over what this project will look like... if any of you have shockwave I would be shocked (sorry for the pun) so I did a screencast of the video so you can have an accurate idea of what I am going for.

https://www.youtube.com/watch?v=C5NBOvF869k

For those of you who don't speak Chinese the different characters are possibilities of the pinyin and the items in the column are image results for the Chinese character at the top of the column. Here is a temp site until I get FTP access to the University server

http://shawnwow.com/chineseCharacterHelpr/

So I brought the Chinese-ime project in to my assignment and I have removed the option for traditional Chinese characters as an intro to Chinese class won't need that. The audio part was also fortunately easy. I used the api from responsivevoice.org so it was one line of code for the functionality and one line of code for the implementation. I felt very fortunate by that. As far as fonts I found this one just came out earlier this month:

https://typekit.com/fonts/source-han-serif-simplified-chinese

It looks nicer than the generic Chinese font, probably use a bolder weight. I am assuming the possible Chinese character output that is an unordered list in HTML came from JavaScript; does anyone know where that array is in the code? I could just grab the unordered list and turn it back into JS on keyup of the textbox... but it seems to be faster to just utilize the same array that produces the unordered list.

Lastly, and my biggest struggle so far, is finding an API to bring in the pictures that are a result of the search for each of the Chinese characters. I need about 3-5 images of each Chinese character below the image. I have been looking around for a few hours over the past week and ask on various forums how I could do this. I don't know if its because I write crazy long messages (like this) or I mention Chinese characters so they think they can't help... but I haven't gotten much for responses.

tsroten commented 7 years ago

What about one of these APIs?

shellwe commented 7 years ago

@tsroten Thank you for your response! I never thought to explore the Bing premium service. It was something worth looking into but the pricing for "Bing Image Search API" will get expensive fast.. https://www.microsoft.com/cognitive-services/en-us/pricing Each search will pull 20 photos (4 images per character and assuming 5 characters). So even if I get the trial it would be used up in 50 searches on our site. Even at $30 we would only get 500 searches... which isn't much and would be charged another $30 to get 500 more and if those are used up (which can happen in one classroom session) then the app can't be used the rest of the month. Then the next step up is $300; which is a little pricey. As far as I understand the Google Custom Search; as I understand it, and I could be very mistaken, it looks like that's just a tool to search your own website or it offers a google search bar where the results go in an embedded google custom page. This is my understanding of how it works (assuming you can have it default to images) and that won't allow for a tight column of photos: http://jsfiddle.net/gh/gist/library/pure/6130833/

I checked out the Flickr API as well but was not sure how to use it. I see there is a REST Request and REST Response and I went into each of those and it just has the code on how to get the request. My goal is to send them a Chinese character and get back 5 or so images (preferably a small version). I don't see enough documentation on the REST Request and Response on how to do that.

I dismissed the Python APIs for Flickr as they were unofficial and I was hoping to not have to learn Python to finish this project, but since I am really happy with the search results with Flickr and I suspect these unofficial APIs are free to use that may be the best avenue to explore further.

Thanks for your help so far!

Edit: found this in the list of APIs: https://www.flickr.com/services/api/flickr.photos.search.html If you go to the bottom it has an API explorer where you can put in values and it outputs a rest request at the bottom. Guess I will be exploring REST with JavaScript.

tsroten commented 7 years ago

I believe you can search the entire web with Google custom search. You can do an image-only search as well: http://stackoverflow.com/a/11206266

I think the Google pricing is 100 searches per day for free.

If price is a concern, you might want to look into scraping the results of a Google image search.

Whatever solution you choose, be sure to cache the results so you don't have to search again for the same character.

shellwe commented 7 years ago

@tsroten

EDIT 3: Now that I have done a little research on how to use rest I see that there are a great deal of tutorials on how to pull images with flickr and as far as I can tell there is no max for number if images I can get for free. I feel I am abusing your kindness by asking stuff not even related to the Chinese-IME app. I am sure I will have questions when I am manipulating the functionality of the app over the next few weeks and I'll save this for that. Thanks!

Thank you so much for your reply. I am a very curious about your idea about an image only search in google. You mentioned scraping so are you saying that I would let the search results appear on the search page and then I would use python to scrape the top 10 or so off? Or are you saying that this service will deliver the images for my search in JSON, XML, or something?

I ask because right now the site is completely front end and the professor I am handing off to wants any backend features to be in python. I don't know python so I am hoping I can get at least what was in the video mockup and he can hand it on to a professional python developer when it gets funding, preferably a developer who knows Chinese so they would know the nuances of the language. We do have the intention of having a database that store the image URLs more so for speed than reducing the costs for searches but if our only options would be pay choices then that may be what we have to do. We also want to have features like blacklisting certain images that do not match the Chinese character.

I am also new to rest services in general. As mentioned I saw this option: https://www.flickr.com/services/api/flickr.photos.search.html and I go to the API explorer link at the bottom: https://www.flickr.com/services/api/explore/flickr.photos.search and put in 猫 for the text field and number of pictures per page at 5 and pages 1, so I would only get back 5 images. This gives me this XML link: https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=4aa062c1cd2da6075e27b599f583665a&text=%E7%8C%AB&per_page=5&page=1&format=rest&api_sig=9f1beab3f0da8f620e197104eb81aa2a This brings up two questions.

How did it convert 猫 to "%E7%8C%AB"? My hope is that I can just reuse this URL over and over and just change out the text field. But to do so I need to know how to convert that text.

My next question would be what would I do with this request? It looks like it is giving ample information to be able to locate each image, but how would I use this XML to pull the images directly into my site?

Edit: scratch the text conversion part, it appears to be a simple URL encode. I was able to duplicate that issue here. http://www.url-encode-decode.com/ I popped the chinese character in and got my output.

Edit 2: I am looking at how photos are organized and they are structured by some components in the REST call so I am guessing I can just construct a URL based on the values from the meta information of each image.

shellwe commented 7 years ago

Okay, I got images pulling from the flickr API. Great idea @tsroten!

The last question I have actually DOES belong here. It was a question I asked before with a bunch of other questions. I assume the possible pinyin outputs, before it gets pushed to that unordered list, is stored in JS somewhere as an array. Does anyone know where that is stored? Knowing that may answer my other questions regarding manipulating this widget.

tsroten commented 7 years ago

They appear to be stored in WordDatabase(): https://github.com/hermanschaaf/chinese-ime/blob/master/jQuery.chineseIME.js#L24 https://github.com/hermanschaaf/chinese-ime/blob/master/jQuery.chineseIME.js#L427 https://github.com/hermanschaaf/chinese-ime/blob/master/jQuery.chineseIME.js#L55

shellwe commented 7 years ago

Thank you for that find!

I did have a lot more here but I thought that the application was sorting the output alphabetically but that was just Chrome sorting it. I was just trying to grab the current value in typed so I can get the Chinese Characters for that but grabbing the last entry only works if they have not typed it before. So if they type mao, then delete it and type li, then go back to mao the last value written would be li because it only records unique values so the last unique is li.

I think I am over complicating this and just need to grab the text inside the textarea and then where it matches in $.wordDatabase.words pull those Chinese characters.

Most likely, as long as I can grab the latest set of Chinese characters I can have something like this.

<div class="毛">
    <div class="chinese-character">毛</div
    <div class="毛-audio">Button for Audio here</div
    <div class="毛-images">Images code here</div
</div>
<div class="猫">
    <div class="character">猫</div
    <div class="audio-clip">Button for Audio here</div
    <div class="flickr-results">Images code here</div
</div>

...and so on for each character. I just have no idea if it is dangerous to make a class name a Chinese character. I could HTML Encode it if it is.

Right now I have a static example of pulling the audio clips (the buttons) and the image pull from the flickr API. For every Character option it needs to have a button underneath it and the set of 5 images below that. So I am not sure if having the characters in .options ul would work. I would be happy just hiding .options and outputting the characters how I want.

The pieces are coming along well so far. http://shawnwow.com/chineseCharacterHelpr/

Thanks for all of your help!

tsroten commented 7 years ago

Great! You're making good progress! Good luck!

shellwe commented 7 years ago

EDIT: Sorry for the edit that basically deleted everything that was there. I was having an issue using a variable to show the Chinese characters based on the current input. I am using a keyup on the input box as my trigger and then pulling from where the contents of the dialog box is in "#chinese-ime .typing".

$(".chinese-input").keyup(function () { currentPinyin = $("#chinese-ime .typing").text(); currentChineseCharacterChoices = $.wordDatabase.words[currentPinyin].choices; });

I curious to find whatever is outputting the Chinese characters to the options class because that would make it shorter still and more accurate. The keyup thing is a little buggy when populating currentPinyin because the text box writes to the the .typing class and leaves the input box blank. You pointed out line 427 and I tried to put $.wordDatabase.getChoices(text); but it outputs an error. Its something I need to dig through with the Chrome debugger; but every time I try to drill in it goes right to the jquery file.

shellwe commented 7 years ago

So the options box that pops up for mao has these 5 options: 毛猫冒帽茂

But in the wordDatabase for mao it has 10 options: 毛猫冒帽茂卯冇貌茅矛

Many looking very similar, (assuming you are familiar with Chinese) would you say someone typing "Mao" would just need the first 5 or would all 10 be useful? If someone would just need the first 5 then I will need to figure out what is narrowing it down to those 5 an generating that. I have been trying to step through breakpoints but not having much luck yet.

shellwe commented 7 years ago

To answer my previous question, I think I will just leave both, we will present the first 5 characters and if the user wants more then they can click a button and get all 10. I have a TON of styling still to do but I am rounding the corner on functionality. Just type something and then click the button.

http://shawnwow.com/chineseCharacterHelpr/

Now for my final piece, after the user clicks on the box I need to insert that character into the textbox and reset the app so they can type the next word so they can construct a sentence. Moving the character the user clicked to the output div and clearing the advanced results should be easy but clearing all of the Chinese IME stuff may be harder. I can clean out the contents in the box but I am sure the context of the box has already been pushed into variables everywhere.

I also need to look at it from a UX perspective as after the person scrolls down to click on a Chinese character to then scroll back up, they may not know to click on the box with the character they just added to add another. Hopefully I can have some concise instructions.

Really, if I could find and duplicate the call where after someone types a number and it chooses that word to put it in the input box as well as JS that would clear the IME so it is available for reuse I would be done with functionality.

Edit: interesting, I just found that if toggle phonetic typing off and back on it clears the search you currently have but leaves the chinese characters that are already there. Now if I can just find where it stores the functionality to put in the text when you type a number I am in good shape!

shellwe commented 7 years ago

Maybe I am just making this all too complicate using this plugin. Anyone notice how it is pulling the Chinese characters when given the pinyin? Since I already have a piece that takes the object and displays out the characters if I can extract that piece I can simplify this a great deal.

I get it is pulling from here: https://www.google.com/inputtools/ based on line 186, I am just a little perplexed on how it is using it like an API.

shellwe commented 7 years ago

@tsroten , just wanna say thanks, we ended up not even using this tool and used some sort of hidden API through Google Input Tools. Here is the app we ended up with: http://shawnwow.com/chineseCharacterHelpr/

The delays are on purpose, the images appear after 9 seconds but the only API that did a global search of an image repository was Flickr (which didn't have accurate search results), we tried to find a good Chinese repository so the images would be more accurate but we were not successful. We'll continue to use flickr until something pops up.

Thanks for your help!