morfologik / polimorfologik

Scripts for preprocessing morfologik data.
39 stars 9 forks source link

problem with building dictionary #1

Closed redguy666 closed 11 years ago

redguy666 commented 11 years ago

Hi, I tried to build dictionary following provided information (dowloaded all external dictionary files, etc), but the result is always different from provided polish dictionary and does not work with solr. Also - when I tried to use fsa scripts with default pl.dict from standard jar - I get errors:

>fsa_guess -d pl.dict

Invalid dictionary version in file: pl.dict Version number is -58 which indicates dictionary was build: with yet unknown compile options (upgrade your software)

milekpl commented 11 years ago

Hi,

W dniu 2013-03-15 11:45, Maciej Lizewski pisze:

Hi, I tried to build dictionary following provided information (dowloaded all external dictionary files, etc), but the result is always different from provided polish dictionary and does not work with solr.

What are the differences? Can you send a diff?

Also - when I tried to use fsa scripts with default pl.dict from standard jar - I get errors:

>fsa_guess -d pl.dict

Invalid dictionary version in file: pl.dict Version number is -58 which indicates dictionary was build: with yet unknown compile options (upgrade your software)

The problem is that you need to use the flags = fsa5 in the Makefile. I will change the target in the Makefile to split more between cfsa2 and fsa5 formats.

Also, you need to use -I with fsa_guess.

Best, Marcin

dweiss commented 11 years ago

CFSA2 should be fine if you plan to use it in Solr since Solr uses Java version of Morfologik (which supports CFSA2). Your problem is somewhere else. Provide exact reproduction steps -- how you compile the dictionary, what are the input files, etc.

milekpl commented 11 years ago

I changed the Makefile as well because it was slightly wrong. Please use the new one. The target pl.dict is fine for Java, polish.dict is for fsa_morph.

By the way, fsa_guess is NOT suitable for morphological dictionaries. Only fsa_morph is.

redguy666 commented 11 years ago

ok.. one thing is now clear - it seems I had old script sources (downloaded them from other location than this git repository). This one uses java application to create dictionary :)

Anyway - downloaded current sources from git, odm.txt, polish.all, pl_PL.aff, converted them to utf-8 (as it was in readme_pl.txt), but when trying to build with "make pl.dict" I get error about missing "eksport.tab" file needed to build polimorfologik.txt. Further look at makefile and there are more files missing: join_tags.awk and version_script.awk (first one is also needed to build polimorfologik.txt) where can I find those 3 files?

milekpl commented 11 years ago

odm.txt, polish.all, pl_PL.aff are not used at all right now; the only source file is eksport.tab but you only need polimorfologik.txt. Basically, this is just a sorted version of the file plus a small addition of the brev*.txt file, and it is huge. So it's easier to host it at sourceforge: simply download morfologik.zip and use polimorfologik.txt for further work. I added the missing scripts right now.

redguy666 commented 11 years ago

thanks for your help! everything seem to work now :)

dweiss commented 11 years ago

Just curious -- what was the reason you needed a custom built of the dictionary?

m4rt commented 9 years ago

make script does not work with newest morfologik tools jar

janchorowski commented 9 years ago

Hi,

sourceforge stopped hosting the morfologik files, where can the polimorfologik.txt be downloaded from right now?

dweiss commented 9 years ago

Please do not attach comments unrelated to the issue. I've created a new one for you, here: https://github.com/morfologik/morfologik-scripts/issues/3