jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Guesswork for filename yields inconsistent results #155

Closed Mirodin closed 3 years ago

Mirodin commented 3 years ago

Following recommendations from the documentation https://paperless-ng.readthedocs.io/en/latest/advanced_usage.html#guesswork I get mixed results with my documents.

  1. 20201027Z - Work - Entgeltabrechnung Oktober.pdf => parses correctly
  2. 20201023Z - Work - Zeitnachweis.pdf => makes "Work" the document title and "Zeitnachweis" a tag

I stumbled upon this when migrating my ~1800 docs for approx. 5 - 10% of my files. For me this does not look like consistent behaviour. Maybe I am missing something but that is how I understand the documentation about this.

I am running the latest docker compose file pulled from github.

Mirodin commented 3 years ago

To reproduce:

  1. Run paperless via docker-compose from docker-compose.sqlite.yml file inside hub
  2. Create a new empty file (i.e. in LibreOffice Writer)
  3. Export as 20200101Z - Work - Entgeltabrechnung Januar.pdf
  4. Export as 20200101Z - Work - Zeitnachweis.pdf
  5. Import via webiinterface
jonaswinkler commented 3 years ago

Hello!

That's a feature that has been in paperless for a long time and I did not touch it at all (neither code nor documentation)

I've checked the logic of the code, and the documentation is in fact not describing the actual behavior. The consumer first checks for the format created - title - tags, which matched the second filename. However, this rule does not accept tags with spaces. If that rule does not match, it will parse the filename as created - correspondent - title. This is what happens for the first file.

I guess its useful for initially importing lots of documents, but apart from that, I don't think many people use this feature. I'm considering to remove most of the logic, see https://github.com/jonaswinkler/paperless-ng/discussions/83 as well. This is particular annoying when someone decides to put - (with spaces around) in a filename, paperless will then split up the title and use part of it for the correspondent. Happened to me a couple times.

I'm not entirely sure what to do with this feature.

Mirodin commented 3 years ago

Reading #83 I would second getting rid of correspondent/tag guessing and just stay with date parsing. However putting a "Z" at the end is kinda weird (even though gscan2pdf does that for me). So maybe have this support some templating? My files usually are named like YYYY-MM-DD title.pdf

jonaswinkler commented 3 years ago

Alright, I'll put that on the agenda.

Edit. The Z usually denotes Zulu time (UTC), however I'm not entirely sure why that's required here when paperless just needs to parse dates, not times.

jonaswinkler commented 3 years ago

If you've got something to add, please do so in the related issue.