Multiple languages? - Githubissues

yves-chevallier commented 4 years ago

I have a document written in French and English. Is this possible to have something like:

spelling_lang=['en_US', 'fr_CH']

dhellmann commented 4 years ago

That isn't supported today, but should be possible to implement.

The SpellingChecker would need to support loading several dictionaries and only reporting an error if the token cannot be found in any of them. It would also have to track suggestions across all dictionaries and include them all.

The configuration options would need to support specifying multiple languages, as you suggest.

And it might be useful to have a directive to control the dictionaries in use for individual files, although that isn't strictly necessary.

Are you interested in contributing those changes?

yves-chevallier commented 4 years ago

I don't know how much I am interested. I am writing a quite long documentation (in french with some english) using sphinx and it is very important for me to have a CI roughly doing a check spell. However I didn't find any good package do to this and I am not really convinced by enchant which doesn't have any good tokenizer...

For example words such as Backus-Naur should be written with a dash and supported in the dictionary as is. Currently I have two words in my dictionary: Backus and Naur because the tokenizer don't understand compound words. Also some words cannot be written with a capital letter such as C keywords (while, for, return). sphinxcontrib.spelling should therefore support the text in the code-block directives and it should support the language keywords by default. Another very annoying/important issue with the spelling is the way the user-dictionaries works. I would much prefer having a support for regex patterns. Such as for the verb eat: [Ee]at(s?|en)|ate or manger in french [Mm]ange(s|ons|z|nt|ai[st])...

It seems sphinxcontrib.spelling is the best candidate for now, but not a good one for French :(

dhellmann commented 4 years ago

Yes, I suppose the quality of support for French terms depends on the underlying library for tokenizing and the dictionary for various conjugated forms of words.

It would probably be possible to support a tokenizer that recognizes technical terms like Backus-Naur, but I haven't looked into that because I haven't needed it myself, yet.

Language-specific terms within code-blocks are interesting. Perhaps the tokenizer for the syntax highlighter could be reused for that.

dhellmann commented 4 years ago

I should also say that most of the code base for sphinxcontrib-spelling doesn't care about which underlying spelling checker is used, so if there is a different library that works better for other languages we could make that pluggable (either based on the language or based on a new configuration option) and hide the differences in the SpellingChecker class.

bmrec commented 3 years ago

I vote for this feature. Now I use a workaround - merged dictionary (en+ru).

sphinx-contrib / spelling

Multiple languages? #64