plasticityai / magnitude

A fast, efficient universal vector embedding utility package.
MIT License
1.63k stars 120 forks source link

Other languages #6

Closed argestes closed 6 years ago

argestes commented 6 years ago

Hello. First of all thanks for your effort. This is a pretty impressive library.

I'm not very experienced on nlp but I'm currently working on a sort of nlp task which involves classifying some text messages without having labeled data. Project I'm working on needs to process Turkish sentences. Can I somehow use this library to train on Turkish documents? If so can you provide me an example or guide me on the process? Thanks.

AjayP13 commented 6 years ago

Hi,

This is absolutely possible. You can download pre-trained Turkish vectors from Facebook (who trained their fastText vectors on Turkish Wikipedia) here: https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.tr.300.vec.gz

Then, un-zip the .gz file to a .vec file. Then you can convert them to Magnitude using the instructions found here: File Format and Converter.

If you want to train your own Turkish models you can use the tutorial found here for Gensim and then convert that resulting file to Magnitude as well.

Since Turkish appears to use an alphabet, the out-of-vocabulary lookups should still work in Turkish.

AjayP13 commented 6 years ago

The instructions for using Magnitude with other languages is now documented in the README.