jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.38k stars 354 forks source link

[OTHER] Dateparser config #651

Closed mirisbowring closed 3 years ago

mirisbowring commented 3 years ago

I started moving all my collected PDFs into paperless.

Now i have a Problem on the naming scheme:

I used the following date:

20210118_bla...
20191123_blub...
...

per default, dateparser expects e.x. yyyy-mm-dd

the dateparser library is able to handle this, if a specific option is passed no-spaces-time: https://dateparser.readthedocs.io/en/latest/settings.html#other-settings

The config must be applied here:

https://github.com/jonaswinkler/paperless-ng/blob/4ef4af452a22ab037eadc1f38483cc714f6d2f1a/src/documents/parsers.py#L204

jonaswinkler commented 3 years ago

It’s not included in the default parsers and it can produce false positives frequently.

This is used not only for parsing dates from filenames (if DATEORDER is configured) but also from the content as well, and strings of 6 to 8 digits that might look like dates aren't exactly uncommon on scanned documents (invoice numbers, etc). This will affect lots of users, and probably in a negative way.

The entire date parsing integration wasn't done by me, and is pretty ugly. Some unwieldy regular expresssions that are impossible to understand and change. There's even multiple mechanisms for retrieving dates from filenames. It's also very slow on longer documents, especially on RPi devices. I need to revise that logic at some point.

mirisbowring commented 3 years ago

So probably it would be best to rename the file with a pre consume Script? and make the 20200315 to 2020-03-15? If so, how can i pass the new filename to the consumer?

jonaswinkler commented 3 years ago

I haven't used that myself yet, and I'm not sure if it works.