Can't train model for polish language

hgrif / wiki-word2vec

Train a gensim word2vec model on Wikipedia.

MIT License

75 stars 26 forks source link

Can't train model for polish language #3

Open Manfed opened 6 years ago

Manfed commented 6 years ago

Hi, I tried to run the code for polish language, but after downloading a data from wiki I've got an error: python process_wiki.py ./data/pl/plwiki-latest-pages-articles.xml.bz2 ./data/pl/wiki.pl.text 2018-01-16 09:51:36,820: INFO: Running process_wiki.py ./data/pl/plwiki-latest-pages-articles.xml.bz2 ./data/pl/wiki.pl.text Traceback (most recent call last): File "process_wiki.py", line 43, in <module> output.write(" ".join(text) + "\n") UnicodeEncodeError: 'ascii' codec can't encode character u'\u0119' in position 20: ordinal not in range(128) make: *** [data/pl/wiki.pl.text] Error 1.

Is there any way to train model on unicode characters, not ascii?

hgrif commented 6 years ago

It works for me on Python 3, could you try that?

Manfed commented 6 years ago

I've modified process_wiki.py file. I changed line 43 to output.write(" ".join(unicode(text)) + "\n") After that the processing started without errors.

hgrif commented 6 years ago

Good to hear! Just to check: are you running Python 2 or 3?

Manfed commented 6 years ago

I didn't change anything in my config and in project files. My default python version is 2.7, didn't notice that earlier :) Probably if this will be run with Python 3 there will be no problems.

hgrif commented 6 years ago

Cool, I've added a note to the code for future reference.

Manfed commented 6 years ago

Processing with my way is finished, but results are strange. model_pl.word2vec.model.txt file has only 42 vectors and most of them contains only 1 character. I'll try to run make with python3. BTW I'm doing this on MAC if this makes some difference :)

EDIT: I changed a python version with command alias python='python3', but now I'm getting the first error message.

hgrif commented 6 years ago

Hm, that's weird: it does seem to work in Polish for me.

What's the result of:

$ python --version

Manfed commented 6 years ago

Result is Python 3.6.3 Maybe is't something with my mac config?

EDIT: The same issue on the Ubuntu.