streetsidesoftware / cspell

A Spell Checker for Code!
https://cspell.org
MIT License
1.25k stars 99 forks source link

[Bug]: Issues with generating new dictionaries using cspell-tools #6379

Open gothrek22 opened 4 days ago

gothrek22 commented 4 days ago

Kind of Issue

Runtime - command-line tools, Building / Compiling

Tool or Library

cspell-tools

Version

8.14.4 and 8.15.2 for cspell-tools-cli

Supporting Library

No response

OS

Other

OS Version

Doesn't really matter

Description

Thanks for the great software.

I've been trying to help out by converting the Hunspell Korean dictionary into a cspell compatible source. But no matter what I try when running conversion, I get core dumps.

That's for sure caused by the size of the dict (11 mb for .aff 44 mb for .dic), I've tried bumping Max old space size up to 60 gigs (I've 64 gigs available right now), and it still dies. Any idea how I could split this job into chunks, so it runs longer but doesn't die?

Reporting this as a bug, because it seems to me that it tries to load up everything at once into memory and process it there, which causes it to run out (probably would run out until some ludicrous size).

Steps to Reproduce

No response

Expected Behavior

No response

Additional Information

No response

cspell.json

No response

cspell.config.yaml

No response

Example Repository

No response

Code of Conduct

Jason3S commented 4 days ago

@gothrek22,

Thank you for trying.

Some dictionaries are very complicated and include nested compound rules.

Can you share some more information:

gothrek22 commented 4 days ago

@Jason3S I've used the one that's packaged by Fedora, which is this one: https://github.com/spellcheck-ko/hunspell-dict-ko

There is also: https://github.com/wooorm/dictionaries/tree/main/dictionaries/ko

I've setup the cspell config to look like so:

  - name: ko
    sources:
      - ko_KR.aff
    format: trie3
    generateNonStrict: true

Will try maxDepth in a sec.

I've tried installing cspell-tools globally and using that directly. Also tried hunspell-reader. Same way.

gothrek22 commented 3 days ago

Tried just now with this config:

---
targets:
  - name: ko
    sources:
      - ko_KR.aff
    format: trie3
    generateNonStrict: true
    maxDepth: 0

NODE_OPTIONS="--max_old_space_size=30720 " cspell-tools-cli build

Still got an OOM Kill. :(

Jason3S commented 3 days ago

@gothrek22,

That means applying the rules is causing something to break.

It is not ideal because it is a limited dictionary, but it is possible to get a basic word list without applying rules by using hunspell-reader.

Like this:

hunspell-reader words --no-transform ko_KR.aff -o ko-words.txt
---
targets:
  - name: ko
    sources:
      - ko-words.txt
    format: trie3
    generateNonStrict: true
    maxDepth: 0

Do you have a link to ko_KR.aff/dic you are using?

Jason3S commented 3 days ago

Do you have a link to ko_KR.aff/dic you are using?

I just noticed that you included it in a previous comment.

gothrek22 commented 3 days ago

Yep, basic dict generated properly. I'm guessing that the issue with compounding rules is that words in Korean can get weirdly complex.

As in, the root word can be both pluralized (in different ways), conjugated on top of that and potentially have additional suffixes. Which can turn a single four radical root word, into tens of permutations.

@Jason3S I'll try to link that to cspell and test it on some of the content I have and get back to you ASAP. Thank you for your help too mate 👍