Open mikob opened 5 years ago
Hmm so it seems the most critical elements for just getting the kana breakdown based on how I've understood the code thus far is the surface form, left id, right id, and cost.
var surface_form = entry[0];
var left_id = entry[1];
var right_id = entry[2];
var word_cost = entry[3];
var feature = entry.slice(4).join(","); // TODO Optimize
The other features (everything after the 4th element in the CSVs) seem to be only for informational output, and not needed for analysis?
Without the extra feature elements (part of speech, pronunciation etc.) we can shave off ~20% (20MB) from the uncompressed version. After compression this saves us about ~13% (2MB) which is quite significant.
@takuyaa do you think there are additional ways to bring down the dictionary file sizes?
Thanks again for your work on this very useful project.
Thanks for the great project! I'm only interested in breaking down kanji into it's kana form. I don't need data about the parts of speech, pronunciation etc. There are currently 12 dictionary files (~17.8MB gzipped) and I want to bring the number down for my simple purposes.
I'm having trouble grasping if it's possible to uninclude the extra info and reduce the amount of dictionary data I need.
Are all the dictionaries critical for getting the kana, or will I be able to modify the code and still get just kana with less dictionary data?