hrs / docsim

A simple, fast command-line tool for searching and comparing text documents.
GNU General Public License v3.0
16 stars 0 forks source link

Treat curly apostrophes like single quotes #6

Closed hrs closed 1 year ago

hrs commented 1 year ago

We do some string manipulation when parsing to ensure that contractions pass through correctly. It's important that "don't" is parsed as "don't" instead of "dont" or "don" "t" to match the entry in the stoplist (and to handle e.g. proper names containing apostrophes).

Documents sometimes use a curly single quote (that's , which is ’ in HTML) in contractions. This adds some logic to ensure that that's parsed the same as a single quote would be.