jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[BUG] Advanced search doesn't find terms like "W-2" (any single character and a dash) #1596

Open dblitt opened 2 years ago

dblitt commented 2 years ago

Describe the bug I keep my W-2 tax forms in Paperless, but advanced searching "W-2" results in an empty response. If I change "W-2" to "WW-2", advanced search behaves as expected. This does not seem to affect the "Title" and "Title & content" search options. I could not find anything in the Whoosh documentation that would explain this behavior. It seems to not be an issue with the "-" because "WW-2" works fine. I have no idea what is causing this.

To Reproduce Steps to reproduce the behavior:

  1. Put W-2 in a document title
  2. Advanced search "W-2"
  3. Observe "0 documents"

Expected behavior It should appear in the search results

Screenshots N/A

Webserver logs

N/A

Relevant information

mweimerskirch commented 2 years ago

Not sure if this should be classified as a bug. It's somewhat expected behaviour due to the default settings in the search library, "Whoosh" that the project uses. By default, it treats all non-alphabetical characters as separators. This means "W-2" will actually be two single-letter words: "W" and "2". And single-letter words are ignored in the search (or during indexing, not sure).

There is a way though to customize the tokenizer and leave the "dash" untouched.

Here's the documentation for that library: https://whoosh.readthedocs.io/ Hope that helps.