jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.38k stars 356 forks source link

[BUG] Case sensitive tag matching does not work #1638

Open aweiser opened 2 years ago

aweiser commented 2 years ago

Describe the bug I wanted to perform automatic tag assignments on a case-sensitive match, in that particular case for the word "Rechnung". I wanted explicitly to use case-sensitive matching in order to avoid conflicts with similar words like "Abrechnung", which would be matched if caseinsensitive is selected.

However, I get no matches when I run document_retagger.

After a brief analysis I found that IMHO some lines in matching.py cause this bug, see below. Not sure if this is done on purpose, could you pls. check?

===== def matches(matching_model, document): search_kwargs = {}

#document_content = document.content.lower()    <-- original source
document_content = document.content                 <-- modified source to allow case sensitive matching
ignoreigor commented 2 years ago

Hi @aweiser,

think you're right. Only lowercase content is processed as haystack in the following regexps. Needles are case-sensitive and disabled with flags in search_kwargs, when cases should be ignored. So when case sensitve, uppercase words will not be found.

In fuzzy matching algorithm, additional lowercase-conversions are done for needle and haystack. So it looks it should work when removing lowercase-converting as you suggest, but I'm not a Python-pro yet.