Closed mirisbowring closed 3 years ago
It’s not included in the default parsers and it can produce false positives frequently.
This is used not only for parsing dates from filenames (if DATEORDER is configured) but also from the content as well, and strings of 6 to 8 digits that might look like dates aren't exactly uncommon on scanned documents (invoice numbers, etc). This will affect lots of users, and probably in a negative way.
The entire date parsing integration wasn't done by me, and is pretty ugly. Some unwieldy regular expresssions that are impossible to understand and change. There's even multiple mechanisms for retrieving dates from filenames. It's also very slow on longer documents, especially on RPi devices. I need to revise that logic at some point.
So probably it would be best to rename the file with a pre consume Script? and make the 20200315 to 2020-03-15? If so, how can i pass the new filename to the consumer?
I haven't used that myself yet, and I'm not sure if it works.
PAPERLESS_FILENAME_DATE_ORDER=YMD
. See the documentation abouth this one.
I started moving all my collected PDFs into paperless.
Now i have a Problem on the naming scheme:
I used the following date:
per default, dateparser expects e.x.
yyyy-mm-dd
the dateparser library is able to handle this, if a specific option is passed
no-spaces-time
: https://dateparser.readthedocs.io/en/latest/settings.html#other-settingsThe config must be applied here:
https://github.com/jonaswinkler/paperless-ng/blob/4ef4af452a22ab037eadc1f38483cc714f6d2f1a/src/documents/parsers.py#L204