spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.31k stars 645 forks source link

language independence ... #39

Closed redaktor closed 9 years ago

redaktor commented 9 years ago

Hey there, again : this is not an issue. The changes recently done are totally fine but let me explain why I make (made or am planning to make) which changes in the fork https://github.com/redaktor/nlp_compromise

As a European I would love this project to be as multilingual as possible ;) The changes made have these goals : • for contributing be totally explanative and readable • for transport be browser-friendly and thus very small • completely separate data / language logic / project logic

Three new files in src/data : dictionary.js The file where we can contribute multilingual words in the categories like in the readme. : dictionary_rules.js (tba) The file where we can contribute multilingual rules. : _build.js To build the data modules for one/some/all languages. This could also be the first grunt step.

It will generate or overwrite a folder like 'en'. Check it out node _build -l Basically I am planning to let the build script generate a customized client side file and additional AMD browser modules. See for instance the module.exportslines, there are more than 30 but they are useless in the browser and apart from that I'd optimize the compressing for browser a bit further.

I do also try to avoid duplicates further. For example in phrasal verbs : Some verbs are already in the verb data module and some adjectives are already in the adj. module ...

When it is complete: • each module e.g. in /parent should only be a littlebit 'project logic'. • our database can autotranslate • I could attach our web interface to encourage translators even more ;)

spencermountain commented 9 years ago

hey man. Love the ideas. I'm not ready to take the plunge into other languages yet, though kudos to the work you're doing. amd support, etc is great, if you keep the PRs small. The /adverbs_decline and stuff is also cool, have you added any more english data? my guess is that any new languages will require a brand new pos.js file, conjugation logic, etc.. then it begins to feel like they should have seperate repos. I really don't know. I know some nlp libraries generalise into all languages, but they are probably more clever than me ;) Lots of good data in the wordnet european language forks. That's where I would go first. cheers man,

redaktor commented 9 years ago

But why should it be in other repos? The rewritten modules in the fork load languages from the /en /de whatever folders in runtime. It falls back to english. But let's say nlp.noun('person', 'de').pluralize() should return 'personen' and not 'persons' ... This can either be solved elegant in

In general that would be ultimate - nlp, dynamically multilanguage in the browser ... For AMD haters the _build.js could generate a one file with the -l en,de,es option ...

spencermountain commented 9 years ago

yeah, i understand what you're thinking. my concern here is that in most lines of code there are assumptions about english - noun.inflect() for example, doesn't make sense to korean nouns.. I'm afraid it's not been built to be very generic, so it will involve more than just swapping the lexicon.

So my recommendation would be to get data for one language, and see how it behaves before designing an api it for all european languages. It'd be sweet to see a javascript nlp in the browser for another language.. I bet there's not much in nlp_compromise that will be helpful for it, in the long run..

... is that pessimistic of me? I think in N America we have a less .. cool and more ... terrified view of other languages ;) Make a spanish demo! cheers

spencermountain commented 9 years ago

or nlp_kompromiss ?

does german have similar parts of speech even? I know there's a third gender.. or gendered verbs or something. I really know nothing. let's make a repo and have fun.

spencermountain commented 9 years ago

...wanna? I couldn't do it without a fluent language-speaker. I speak some french, but badly. How's yours??

redaktor commented 9 years ago

so it will involve more than just swapping the lexicon.

Yes - I am coding as fast as I can ;) I' ll do german and we have translators for spanish and maybe one for danish/swedish. But however. First we need to make translation easygoing, I thought. Regarding the rules - I find multiple similar rules in german which need just to be slightly modificated, e.g. the substitution value changes... My french is badder than yours ;)

redaktor commented 9 years ago

have you added any more english data?

just a few, like

// things that are often named after people { en: 'club', meta: {personBlacklist: ['en']} }, { en: 'museum', meta: {personBlacklist: ['en']} }, { en: 'hall', meta: {personBlacklist: ['en']} }, { en: 'arena', meta: {personBlacklist: ['en']} }, { en: 'stadium', meta: {personBlacklist: ['en']} },