pandoc / lua-filters

A collection of lua filters for pandoc
MIT License
611 stars 166 forks source link

Spellcheck filter error in parsing aspell output #90

Closed agitter closed 4 years ago

agitter commented 4 years ago

When aspell is run on a raw Markdown file that contains a possessive such as pandoc's, the entire string pandoc's is returned as a misspelled word. When the pandoc spellcheck filter is run on that Markdown file, only the suffix s is returned. The behavior is similar for other words with apostrophes.

I noticed the spellcheck filter uses the following pattern to capture text from the aspell output https://github.com/pandoc/lua-filters/blob/3c870cb5799fb1c4cb961b6648e5b3cddc50cfde/spellcheck/spellcheck.lua#L33

I'm not confident that is causing this behavior, but the behavior is unexpected. It causes the spellcheck filter to always return the string suffixes that follow an apostrophe, even if those suffixes are in the aspell custom dictionary.

agitter commented 4 years ago

Changing this pattern to ([%a’]+)\n solved my initial problem. However, that pattern still fails to match and return misspelled words that contain accents. "Naaïve" is written as "ve", which makes it difficult to find the spelling error.

The pattern ([%S]+)\n meets my needs. I could make a pull request to change this, but I'm not sure what the intended behavior is in general for this filter.

jgm commented 4 years ago

I think one issue with [%S]+ is that it will give bad results in cases where words abut punctuation, e.g.

dogs, cats, and ferrets.

We don't want the commas and periods included here. (Similarly parentheses, dashes, quotation marks.)

But I'm definitely open to other suggestions.

jgm commented 4 years ago

Regarding [%a']+ note that we don't want the single quotes to be included when they're functioning as quotation marks, rather than apostrophes.

agitter commented 4 years ago

The version of aspell I'm testing with (version 0.60.7) appears to strip leading and trailing punctuation.

I created a file demo.txt with spelling errors:

dogsxyz, catsxyz, and ferretsxyz.
'bearsxyz' and "wolvesxyz"?

The aspell output does not include any punctuation:

> cat content/demo.txt | aspell list
dogsxyz
catsxyz
ferretsxyz
bearsxyz
wolvesxyz

I'm not familiar enough with aspell to know how universal this behavior is across versions, modes, and languages.

jgm commented 4 years ago

ah, okay! Why don't you go ahead and submit a PR?