takuyaa / kuromoji.js

JavaScript implementation of Japanese morphological analyzer
832 stars 117 forks source link

Using only kanji->kana data #31

Open mikob opened 5 years ago

mikob commented 5 years ago

Thanks for the great project! I'm only interested in breaking down kanji into it's kana form. I don't need data about the parts of speech, pronunciation etc. There are currently 12 dictionary files (~17.8MB gzipped) and I want to bring the number down for my simple purposes.

I'm having trouble grasping if it's possible to uninclude the extra info and reduce the amount of dictionary data I need.

Are all the dictionaries critical for getting the kana, or will I be able to modify the code and still get just kana with less dictionary data?

mikob commented 5 years ago

Hmm so it seems the most critical elements for just getting the kana breakdown based on how I've understood the code thus far is the surface form, left id, right id, and cost.

        var surface_form = entry[0];
        var left_id = entry[1];
        var right_id = entry[2];
        var word_cost = entry[3];
        var feature = entry.slice(4).join(",");  // TODO Optimize

The other features (everything after the 4th element in the CSVs) seem to be only for informational output, and not needed for analysis?

mikob commented 5 years ago

Without the extra feature elements (part of speech, pronunciation etc.) we can shave off ~20% (20MB) from the uncompressed version. After compression this saves us about ~13% (2MB) which is quite significant.

@takuyaa do you think there are additional ways to bring down the dictionary file sizes?

Thanks again for your work on this very useful project.