Open MrAlfabet opened 3 years ago
This has been reported a few times, date parsing isnt amazing at the moment but not exactly a bug, I think. It does need improvement, think thats kinda "on the list". See https://github.com/jonaswinkler/paperless-ng/discussions/593 , for example
This has been reported a few times, date parsing isnt amazing at the moment but not exactly a bug, I think. It does need improvement, think thats kinda "on the list". See #593 , for example
I could understand if DD/MM and MM/DD would get messed up, but these dates have nothing to do with the document creation date. How could two documents, created seconds from eachother get different creation dates? I see none of this reported in the other thread, so I thought this example would be nice to have when the whole thing gets revamped.
Paperless does not use the "created" date provided by filesystem metadata, it scans the content of the document for dates and uses that to assign the "created" field.
Paperless does not use the "created" date provided by filesystem metadata, it scans the content of the document for dates and uses that to assign the "created" field.
Ah, in that case I think I should add that as a feature request? That kind of explains a lot.
Most of the time (pretty much always for scanned documents), the date created on the filesystem will not match the date the document was actually created (such as the date of an invoice).
I'd have to agree. Perhaps the addition of a 'scanned' or 'file created' date would be the feature I'm looking for? The only time this feature would be useful is if paperless is consuming a folder of documents scanned (way) before the first run. Since the 'added' date for (most of) these documents seems to be incorrect, I now have no way of organizing these documents by 'scanned' date.
Right now I'm scanning and instantly processing everything that comes in the mail (that's relevant), and I can organize them by 'date added' to get the order I want, but not for my 3 year backlog of scans that was consumed on first run.
Edit: this might be too off-topic, but I've noticed paperless-ng also uses quite a bit of disk space when processing documents. The traceback you see in the logs in the OP was triggered when a 20MB pdf needed over 800MB of diskspace to be processed, something my container was not configured for. Note that the data directory is mounted elsewhere, so this is used by some temp/processing folder. It would be nice to see a mention of this under 'resource usage' in the readme.
One potential idea: A few of the scanners I've used put the "scanned" date in the pdf metadata/exif. I know that's not necessarily the document production date, but an optional setting/method to use the "scanned" date as per the file metadata would be hugely appreciated (in my cases it'd improve the accuracy).
Describe the bug The 'created' date is incorrect for consumed documents.
To Reproduce I'm scanning documents from an HP LaserJet Pro MFP M225dw over the network to a samba share hosted in a debian container. The created files are 300dpi color pdfs. I've installed Paperless-ng in an LXC container, which has the shared folder mounted inside (proxmox mountpoint, so no NFS/smb). Paperless has consumed all the files that were previously in that folder, made them searchable, and gave everything a 'created' date, which I thought was lovely since I've had scans in that folder over 3 years old. I never bothered to check if these 'created' dates could be correct though, as I assumed Paperless would just look at the file creation date.
Now that I'm scanning new documents that come in the mail, I've noticed strange behavior; consumed documents/scans will get a 'created' date that is not correct (so far, only in the past). First I thought it was a time-zone issue, or maybe the date/time on the printer was set incorrectly, but this turned out to be not the case. Different documents will get a different 'created' date, even if scanned just seconds apart. The same document, however, if scanned multiple times, will always get the same 'created' date.
I just now (9th of may) scanned 2 different documents a couple of times in random order. I now have 2 or 3 copies of both documents, the 2 copies of 'document A' have a 'created' date of April 4th, and the 3 copies of 'document B' have a date of March 1st.
Expected behavior Documents scanned and pdf's created on May 9th should get a 'created' date of May 9th in Paperless
Screenshots If applicable, add screenshots to help explain your problem.
Webserver logs
Relevant information
paperless.conf
: PAPERLESS_OCR_LANGUAGE=eng+nld PAPERLESS_TASK_WORKERS=2 PAPERLESS_THREADS_PER_WORKER=4 PAPERLESS_TIME_ZONE=Europe/Amsterdam I have also modified the secret key, and allowed_hosts, and cors_allowed_hosts for reverse proxy.