[spellchecker] unable to load eu dictionary

jcb91 commented 6 years ago

(this issue comes from @jmigartua, originally via email, reproduced here with consent)

The point is that I am trying (hard, I am not very proficient in these things) to install another hunspell dictionary to spell check the Jupyter Notebooks I work with, to prepare the class material for my stundents. I would like to spell check Basque and for it I have the corresponding files: eu_ES.dic and eu_ES.dic. I have taken them from this address: https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/eu/index.dic and the similar for the .aff

I have tried various options:
write the url at the Nbextensions configurator (It does not work)

I have tried to use the .py script to install the dictionaries locally, taken from the readme of the extension (It does not work)

I have copied manually the two files in the proper folder, as indicated in the readme of the extension: ./typo/dictionaries/eu_ES.dic It does not work
I have tried to write down myself the config.yaml file It does not work This is the file:
Type: Jupyter Notebook Extension    # I have changed IPython -> Jupyter in the desperate hope it could work...
Compatibility: 4.x, 5.x
Name: spellchecker
Main: main.js
Description: 'Adds a CodeMirror overlay mode for Typo.js spellchecking'
Link: README.md
Parameters:
- name: spellchecker.enable_on_load
input_type: checkbox
description: enable spellchecker for all Markdown/Raw cells on notebook load
default: true
- name: spellchecker.add_toolbar_button
input_type: checkbox
description: add a toolbar button to toggle spellchecker on and off for all Markdown/Raw cells
default: true
- name: spellchecker.lang_code
intput_type: text
description: language code to use with typo.js
default: 'eu_ES'
- name: spellchecker.dic_url
intput_type: url
description: url for the dictionary .dic file to use
default: 'https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/eu/index.dic'
- name: spellchecker.aff_url
intput_type: url
description: url for the dictionary .aff file to use
default: 'https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/eu/index.aff'
I have a fresh installation of the extensions, from this morning.

I would very much appreciate if you could help me.

jcb91 commented 6 years ago

Hi Josu!

Ok, so the first thing to note is that I think you're doing everything more or less correctly with your settings. Any of the first three options you've tried should work correctly. Editing the yaml file won't help, as it only affects what the configurator displays (not what the nbextension actually uses), but it also won't hurt.

From my brief investigation, I think the problem arises because the typo.js code is failing to parse the .aff file correctly, but for some reason this doesn't show up as an error in the browser's javascript debugger, so I'm having some difficulty checking which of the 90,000+ lines is causing the problem. I'll keep looking, and get back to you if/when I find out more...

jcb91 commented 6 years ago

So, I've found a typo in the copy of typo.js in this repo (irony!):

https://github.com/ipython-contrib/jupyter_contrib_nbextensions/blob/e4c3c76c3dacc9f4e1631e2866feeaad3094cd43/src/jupyter_contrib_nbextensions/nbextensions/spellchecker/typo/typo.js#L449

should read textCodes, which causes the call at https://github.com/ipython-contrib/jupyter_contrib_nbextensions/blob/e4c3c76c3dacc9f4e1631e2866feeaad3094cd43/src/jupyter_contrib_nbextensions/nbextensions/spellchecker/main.js#L53-L55 to fail, without further explanation...

jcb91 commented 6 years ago

However, with that fixed, it seems to just break the browser tab. The Typo code seems to be attempting to try to form all possible words in the dictionary, which, given the huge numbers of combinations in the affix file, along with the large number of stems in the dictionary (the eu .aff and .dic files are 3.15MB and 2.05MB, compared to the mere 3.01KB and 532KB for the English versions), causes the page to crash (at about 5GB memory, on my machine). Using the new version of Typo, it throws a RangeError: Maximum call stack size exceeded, for what I assume is basically the same reason.

I suspect that this represents a flaw in how Typo.js is implemented, since I can get the binary hunspell to perform without obvious error on small samples of text using the relevant dictionary files. The eu files contain suffix-sets composed of large numbers of distinct options (750+), many of which can be followed by other equally-complex suffixes, or even themselves(!). e.g. eu/index.aff#L75633 defines an entry for suffix-group 243 with text ago, which can be followed by the same suffix-group (243):

SFX 243    0       ago/243      .

As a result, the entry in eu/index.dic#L37,

abai/243

can be expanded to form

abai
abaiago
abaiagoago
abaiagoagoago
abaiagoagoagoago
[...et cetera ad infinitum]

I've no idea whether this actually makes sense as a representation of the basque language (I suspect, based on comments on the hunspell page of http://xuxen.eus/eu/bertsioak, that it may be just the best that can be achieved given hunspell's limitations), but I think it's playing havoc with Typo's slightly naive implementation of the hunspell format. In regular hunspell, I think these are stored as something like linked lists of building blocks, so it functions ok, just checking whether it can build a given word with the given the ruleset (see this comment). However, I think Typo tries to build every possible word given the rules (see typo/typo.js#L590-L604), which in recursive cases like the above, clearly just blows up.

jcb91 commented 6 years ago

So in essence, I think this is probably the fault of Typo.js. Whether it's simple enough to be considered a bug that can be solved, I'm less sure, since it would presumably require a fairly major rewrite in order to get Typo to work more like hunspell proper. Assuming you have a browser spellchecker that works correctly for the relevant dictionaries, it might be easier to attempt to make an alternative nbextension which would allow markdown editing using native textarea elements & the browser spellchecker, rather than CodeMirror editors (this would however lose all the formatting, keyboard shortcuts and the like for markdown cell editing)

jcb91 commented 6 years ago

Actually, it seems hunspell limits words to having two affixes, so my former example should stop at

abai
abaiago
abaiagoago

but equally that Typo.js makes no check on the number of prefixes already applied...

jcb91 commented 6 years ago

However, even limiting them doesn't really help, since Typo still tries to form every possible word from the available rules, which from my calculations, makes something in excess of 5.7 billion words, amounting to a dictionary of about 53.7 Gb, even without each word's associated metadata.

The aff & dic files seem not to be correctly formed (they have quite a lot of duplicate entries, in fact I find the aff file's suffix definitions to be around 23% duplicates, while the dic file is composed of about 8% duplicates). However, even with duplicates removed, the diverse suffixes lead to a huge number of potential words, which makes Typo's approach of explicitly enumerating them all impractical.

jcb91 commented 6 years ago

To clarify what's going on, this is my understanding of how typo constructs words:

pick an entry (line) in the .dic file, e.g.

abai/243

on line 37
if the line has no /, then that's the complete word. If there is a /, then extra rules apply to this word. The most common is a numeric code for an affix (a prefix/suffix), e.g. the affix-code 243, as mentioned in 1.
To find possible variants on this word, we apply the affix code(s) to the original entry from step 1:
1. find affix-code 243 in the .aff file. It will be in a line of the form
  
  SFX 243 Y 520
  
  (line 75631 in my copy of the aff file) which indicates a suffix-type affix (SFX), with code 243, which can be comnined with other affixes (Y), and comprises 520 different variations.
2. each of the 520 variations is recorded by a line like
  
  SFX 243 0 agoaren/238 .
  
  which indicates that the new word is formed by removing no characters (the 0), then adding agoaren to the end, and then optionally also applying a further suffix with code 238. The . indicates that it can be applied to any preceding word (other suffixes are restricted to words with particular endings). So, we've now got abai, abaiagoaren, plus all the 406 variants provided by rule 238, such as adding etakoaz, to give abaiagoarenetakoaz. I think it's fairly clear that this is rapidly getting to be a lot of words.

In the English dictionary, there are a relatively small number of pre/suffixes, used in fairly limited circumstances, such as

SFX Y Y 1
SFX Y   0     ly         .

which gives the suffix ly, which can be added to verbs to form adverbs (quiet -> quietly for example), or

SFX R Y 4
SFX R   0     r          e
SFX R   y     ier        [^aeiou]y
SFX R   0     er         [aeiou]y
SFX R   0     er         [^ey]

which is used to form comparative adjectives. The variants are used to apply to different spellings, e.g. the first can be applied to any word ending in e, eg. able-> abler, while the second for any word ending in a y not preceded by a vowel, e.g. happy -> happ -> happier. Each of these suffixes though only creates a single extra variant of the word it can be applied to, whereas most of the eu suffixes seem to have several hundred variant forms(!). As a result, I get the impression that the variant suffixes in the eu dictionary are being used for much more complex purposes than just spelling differences, given that each suffix seems to have lots of variants none of which are restricted, and can mostly also be chained with others. Either that, or the way that hunspell works has been misunderstood or misapplied somehow, but as I don't know any basque, I can't tell which (if either) interpretation is closer to the truth...

jcb91 commented 6 years ago

So, English is obviously a rather simple example for hunspell, (most verbs don't change based on object, there aren't many tenses, and those we have are often compound, so don't alter the verb itself much, there are no declensions), but from a quick read of the wikipedia article on Basque grammar, it's clear that the situation for hunspell (attempting to decide whether a given word is 'valid' or not) is rather complicated, even compared to other European languages like German or the Romance languages :worried:

Something that stands out to me is the infixes (a bit added to the middle of a word), of which I don't think hunspell has any concept (it only seems to handle pre- and post-fixes). Perhaps they could be handled as postfixes by removing and then replacing the characters following the infix, depending on how complex the rules are, I'm not sure.

Anyway, you can see the little bit of analysis I did of the existing dictionaries (looking for duplicates, deciding how many words were represented, etc) in this gist.

ipython-contrib / jupyter_contrib_nbextensions

[spellchecker] unable to load eu dictionary #1158