moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.3k stars 147 forks source link

[FEAT] Split out system installs from spellchecker bash script #2101

Open zslade opened 6 months ago

zslade commented 6 months ago

Is your proposal related to a problem?

The spellchecker bash script depends on homebrew to install packages required for the pyspelling spellchecker, however not everyone uses/can use homebrew.

Related to PR #2025

Describe the solution you'd like

Following a discussion with @ThomasHepworth, we think the best way forward would be to put system-specific guidance in the Docs for the installs of packages (e.g. aspell) and remove these steps from the bash script. The bash script will then just be used to run the spellchecker.

See comments here: https://github.com/moj-analytical-services/splink/pull/2025#issuecomment-2022614084

Describe alternatives you've considered

Additional context

zmbc commented 6 months ago

we think the best way forward would be to put system-specific guidance in the Docs... and remove these steps from the bash script

I don't see any advantage to doing this. If the person is using Mac and Homebrew, the current script makes things more convenient than having to follow manual instructions.

In the other thread, you said:

(We don’t as use conda so a conda only option wouldn’t work for us).

Just because you don't use conda now doesn't mean you couldn't start, and it would remove all this complexity, since it is cross-platform. But if you don't want to use conda, it could at least be added in addition to Homebrew in the script so it is an option. No harm in that, and as I said I'd be happy to contribute it. In fact, doing so would be quite simple: I'd just add aspell and go-yq to the existing conda quickstart instructions, and then skip that part of the script if aspell and yq are already present.

Lastly, I am still confused about the LibreOffice dictionaries (not the custom dictionary for Splink). From what I can find online, aspell does not support .aff files, so I don't understand how that is working. And the LibreOffice dictionaries don't seem to be necessary: I now get Spelling check passed :) in the master branch, despite having commented out those lines of the script entirely.

zslade commented 6 months ago

I don't see any advantage to doing this. If the person is using Mac and Homebrew, the current script makes things more convenient than having to follow manual instructions.

The original rationale for doing things this way was because the installs are a one-off and it lessens the burden of script/documentation maintenance when we can instead point to external documentation owned by package/package manager creators.

Just because you don't use conda now doesn't mean you couldn't start, and it would remove all this complexity, since it is cross-platform. But if you don't want to use conda, it could at least be added in addition to Homebrew in the script so it is an option. No harm in that, and as I said I'd be happy to contribute it. In fact, doing so would be quite simple: I'd just add aspell and go-yq to the existing conda quickstart instructions, and then skip that part of the script if aspell and yq are already present.

Appreciated. However unfortunately conda isn't something we are planning to adopt (at least not any time soon) so many thanks for your contribution in PR #2131 to make the spellchecker more accessible to more people!

Lastly, I am still confused about the LibreOffice dictionaries (not the custom dictionary for Splink). From what I can find online, aspell does not support .aff files, so I don't understand how that is working. And the LibreOffice dictionaries don't seem to be necessary: I now get Spelling check passed :) in the master branch, despite having commented out those lines of the script entirely.

Nice spot! I think these were legacy files from an earlier dev version and/or possibly replying in hunspell instead. Thanks for removing them in your PR #2131