retextjs / retext

natural language processor powered by plugins part of the @unifiedjs collective
https://unifiedjs.com
MIT License
2.36k stars 93 forks source link

Is this English only? #14

Closed lfilho closed 10 years ago

lfilho commented 10 years ago

First of all, awesome and really well done project. Congrats!

Is it only for English? I mean, I saw that it supports latin chars, but would it work for say, separating syllables, or finding phonetics / rhymes in portuguese text?

Thank you!!

wooorm commented 10 years ago

Good question, and thanks :)! I’m not sure I have a good answer tho.

Portuguese: It should certainly work. I’ve currently build general latin-script, English, and Dutch parsers. Portuguese could be easily added. Just take a look in the parse-english source code, it’s not that hard (parse-latin does all the work).

Syllables: Thing is, these parsers don’t support splitting syllables out of the box. You would need to write your own parser, currently, and add similar definitions to TextOM. Also: Syllables are part of spoken language, whereas retext and its underlying structures are more focused on written language.

Phonetics: Still, it’s certainly possible to find phonetics with retext. The metaphone (double metaphone especially) algorithm is actually pretty good with languages other than English, so you could give the retext-double-metaphone plugin a try?

Portuguese phonetics: A quick google also turned up a portuguese metaphone implementation in C (albeit its focus seems to be on Brazilian portuguese), but that could also be ported to JS, and in turn a retext plugin, of course.

Rhyme: And last, rhyme is pretty hard to detect (I havent delved into it myself, so actually not sure). A quick Google search on “nlp rhyme detection” was to no avail…

Did that help?

lfilho commented 10 years ago

Wow, thanks for the great, detailed answer. It sure did help!

Portuguese: I will take a look at the parse-english code. Thanks.

Syllables: I'm not from the NLP world so my knowledge in this area is very limited. I don't how complex a syllable parser would be. Idea: what if we have a comprehensive dictionary, already splitted, and then have retext just look it up? What do you think of the idea and would this be in sync with your goals for retext?

Phonetics: I had actually given the retext-double-metaphone plugin a try, but it didn't seem it was matching the phonetics well (in portuguese that is). I have pasted the following text into the demo:

Tua casa. Sua gaza. 
Meu coração, seu contenção.
A cadeira é feita de madeira.
O corte é de grande porte.
A cama tem fama e ama.
O cão come pão pois tem fome.

Take the last line as an example: cão and pão are very similar, as well come and fome, but their colors there didn't match... Maybe I'm misreading something?

Portuguese phonetics: Brazillian portuguese would be even better for me: I'm brazillian ;) I will take a look at that link, thanks! I'm not a C programmer but I'll give it a shot.

Rhyme: Yeah... :( Again, I'm no expert, but my possibly naïve idea was to 1) separate the words in syllables first, then get the phonetics of the last syllable and the compare them with that. What do you think of that?

Thanks again for your time and this great project.

lfilho commented 10 years ago

Oh, in time: I did have a look at NodeNatural before, but they also does not support the things I said earlier, and I found the project a little disorganized, lacking modularity and documentation... And since I'm weak when it comes to NLP theory, that just made things harder for me ;)

That's one reason I liked your project. Well structured and documented. Congrats!

wooorm commented 10 years ago

Cool! Thanks a lot :+1: !

Syllables: That would work. Thing is, large comprehensive dictionaries tend to be, well, large. That would certainly work in Node (just like retext-pos), but not on the client (maybe in browser extensions).

Phonetics: Your right. Double metaphone does not take vowels into account (thus "so" and "sa" get the same phonetics). I don’t know how cão and pão are pronounced in Portuguese, but do you think people would misspel a word as cão if they heard it as pão? Because that’s what double-metaphone tries to accomplish.

I suggest two things:

Rhyme: Yeah thats the most viable route I think, but as aforementioned, you need a phonetic algorithm which takes bowels into account (Metaphone 3, which is paid software, seems to do that). If you find such an algorithm, it might even be possible to just compare the endings of two values (e.g., cão, pão) and see if they match? Not sure tho!

Did that help? Can I close this? Good luck! Let me know if you need any more help :smile:

lfilho commented 10 years ago

Yeap, that helped a lot. Thanks one more time.

Indeed my goal was not to use such things on the browser, but rather consume the results via REST...

Cheers. See ya!

wooorm commented 10 years ago

Cool! Creating a web API is something I really want to do in the future, would be awesome!

lfilho commented 10 years ago

Nice. When I have more time to go back to this playground project, I will definitely come back here (already starred) and try to contribute back with whatever I can, of course! All the best!