Open bryanjenningz opened 1 year ago
Using a better separator to make parsing logic easier seems good to me. I can't think of any downside to keeping the current logic the same in general. A couple thoughts:
I only bring this up since if there were any expansive or hard changes to the data format, I would suggest waiting. In this case, the change seems easy so there's little lost.
Generally, the update dictionary script runs Every Monday night (for me) so let me know if you want to coordinate timing the release in case it's hard to align the dictionaries otherwise.
I'll assign this to you for now assuming you want to work on it!
@all-contributors Add @bryanjenningz for ideas.
@melink14
I've put up a pull request to add @bryanjenningz! :tada:
Is your feature request related to a problem? Please describe. Currently, the dict.dat file and names.dat file use spaces to separate fields, which makes parsing logic complicated because there are spaces in the definitions field. To add to the complexity, some of the lines in dict.dat and names.dat leave out the pronunciation field, so this makes the parsing logic even more complicated.
Examples of the current format in dict.dat and names.dat which include the pronunciation field:
Examples of the current format in dict.dat and names.dat which don't include the pronunciation field:
If we wanted to add another field to these dictionary files (e.g. pitch accent) it would require us to change the parsing logic to be even more complicated and even harder to understand. The current format is also using extra characters to help differentiate fields which makes the dictionary files larger than they need to be (i.e. we use "[" and "]" characters to indicate the pronunciation field and we use the "/" character around the definitions field, which causes 4 extra characters to be used per line - since there are about 1,030,000 lines total in dict.dat and names.dat together, this means we add about 4MB extra space by using this dictionary format).
Describe the solution you'd like A simple solution would be to use a different character to separate each field in dict.dat and names.dat - for example we could use the tab character (\t) or the pipe character (|) or any other character which is never used in the fields of dict.dat and names.dat.
If we used the tab or pipe character, parsing logic would be trivial, we could just split the line by tabs or by pipes (e.g.
const [word, pronunciation, definitions] = line.split('\t');
). This would also make it easier for us to add new fields in the future without making the parsing logic more complex (e.g. if we added pitch accent, we could just parse it like thisconst [word, pronunciation, definitions, pitchAccents] = line.split('\t');
). This would also allow us to save memory by removing the "[", "]", and "/" characters around the pronunciation and definitions field and we wouldn't need to add any extra characters around any new fields we add in the future.So the example lines from above would be formatted like this (separated by \t):
Notice that the last 2 entries don't have a pronunciation so there are 2 tabs in a row, this makes parsing easier.
Describe alternatives you've considered We could possibly just keep the format the same and have parsing logic in a single function and unit test that function very well so that it would prevent any bugs. If we keep the format the same then we can reuse this function in the update-db.ts file which is where the files get indexed and we could also use this function in the data.ts file where the entries are parsed. But if we do this, then this would still use up 4MB extra space (because of the "[", "]", and "/" characters around pronunciation and definitions) and it would still require us to write a separate parsing function in the future if we add an extra field (e.g. a pitch accents field), so this function would only be used in the update-db.ts file if we add any extra fields in the future. Another possible solution would be to use a separate file for pitch accents and then index that file separately rather than adding pitch accents as a new field in dict.dat and names.dat. This way, we could reuse that parsing function but we would still need to parse the pitch accents file. But it might be more simple to just add the pitch accents as a new field so they're packaged together, so it is probably a better idea to just simplify the dict.dat and names.dat file format to use tabs or pipes as separators.
Additional context Right now, dict.dat and names.dat are created in the update-db.ts file. They are fetched from a different source, then they're indexed and saved but the file format isn't changes, so this is where we would change the file format to use tab or pipes instead of the current format. If we want to add any fields in the future (e.g. pitch accents field), this is also where we would add them.