jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 358 forks source link

[BUG] 'Created' date incorrect for scanned pdf's #999

Open MrAlfabet opened 3 years ago

MrAlfabet commented 3 years ago

Describe the bug The 'created' date is incorrect for consumed documents.

To Reproduce I'm scanning documents from an HP LaserJet Pro MFP M225dw over the network to a samba share hosted in a debian container. The created files are 300dpi color pdfs. I've installed Paperless-ng in an LXC container, which has the shared folder mounted inside (proxmox mountpoint, so no NFS/smb). Paperless has consumed all the files that were previously in that folder, made them searchable, and gave everything a 'created' date, which I thought was lovely since I've had scans in that folder over 3 years old. I never bothered to check if these 'created' dates could be correct though, as I assumed Paperless would just look at the file creation date.

Now that I'm scanning new documents that come in the mail, I've noticed strange behavior; consumed documents/scans will get a 'created' date that is not correct (so far, only in the past). First I thought it was a time-zone issue, or maybe the date/time on the printer was set incorrectly, but this turned out to be not the case. Different documents will get a different 'created' date, even if scanned just seconds apart. The same document, however, if scanned multiple times, will always get the same 'created' date.

I just now (9th of may) scanned 2 different documents a couple of times in random order. I now have 2 or 3 copies of both documents, the 2 copies of 'document A' have a 'created' date of April 4th, and the 3 copies of 'document B' have a date of March 1st.

Expected behavior Documents scanned and pdf's created on May 9th should get a 'created' date of May 9th in Paperless

Screenshots If applicable, add screenshots to help explain your problem.

Webserver logs

[2021-05-09 04:28:35,826] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Color_Scan_000887.pdf to the task queue.
[2021-05-09 04:28:35,831] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/ventilatie principeschets.pdf to the task queue.
[2021-05-09 04:28:35,833] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/adkamermarkt.pdf to the task queue.
[2021-05-09 04:28:35,835] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/pensioenoverzicht .pdf to the task queue.
[2021-05-09 04:28:35,836] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Color_Scan_000758.pdf to the task queue.
[2021-05-09 04:28:35,838] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Color_Scan_000895.pdf to the task queue.
[2021-05-09 04:28:35,840] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Koopovereenkomst .pdf to the task queue.
[2021-05-09 04:28:35,841] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Ondertekende VSO .pdf to the task queue.
[2021-05-09 04:28:35,843] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /opt/paperless/src/../consume
[2021-05-09 04:28:36,044] [INFO] [paperless.consumer] Consuming Color_Scan_000887.pdf
[2021-05-09 04:28:36,047] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 04:28:36,121] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 04:28:36,126] [DEBUG] [paperless.consumer] Parsing Color_Scan_000887.pdf...
[2021-05-09 04:28:38,183] [INFO] [paperless.consumer] Consuming Color_Scan_000895.pdf
[2021-05-09 04:28:38,183] [INFO] [paperless.consumer] Consuming Color_Scan_000758.pdf
[2021-05-09 04:28:38,185] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 04:28:38,187] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 04:28:38,200] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 04:28:38,202] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 04:28:38,204] [DEBUG] [paperless.consumer] Parsing Color_Scan_000895.pdf...
[2021-05-09 04:28:38,206] [DEBUG] [paperless.consumer] Parsing Color_Scan_000758.pdf...
[2021-05-09 04:28:42,026] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /opt/paperless/src/../consume/Color_Scan_000758.pdf
[2021-05-09 04:28:42,270] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/opt/paperless/src/../consume/Color_Scan_000758.pdf', 'output_file': '/tmp/paperless/paperless-9f5ce7ex/archive.pdf', 'use_threads': True, 'jobs': '4', 'language': 'eng+nld', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-9f5ce7ex/sidecar.txt'}
[2021-05-09 04:36:33,833] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-9f5ce7ex
[2021-05-09 04:36:33,846] [ERROR] [paperless.consumer] Error while consuming document Color_Scan_000758.pdf: OSError: [Errno 122] Disk quota exceeded: '/tmp/ocrmypdf.io.y8r2t25a/optimize.opt.pdf'
Traceback (most recent call last):
  File "/opt/paperless/src/paperless_tesseract/parsers.py", line 232, in parse
    ocrmypdf.ocr(**args)
  File "/opt/paperless/.local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
  File "/opt/paperless/.local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 373, in run_pipeline
    exec_concurrent(context)
  File "/opt/paperless/.local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 299, in exec_concurrent
    pdf = post_process(pdf, context)
  File "/opt/paperless/.local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 234, in post_process
    return optimize_pdf(pdf_out, context)
  File "/opt/paperless/.local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 821, in optimize_pdf
    optimize(input_file, output_file, context, save_settings)
  File "/opt/paperless/.local/lib/python3.7/site-packages/ocrmypdf/optimize.py", line 607, in optimize
    pike.save(target_file, **save_settings)
  File "/opt/paperless/.local/lib/python3.7/site-packages/pikepdf/_methods.py", line 807, in save
    recompress_flate=recompress_flate,
OSError: [Errno 122] Disk quota exceeded: '/tmp/ocrmypdf.io.y8r2t25a/optimize.opt.pdf'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/paperless/src/documents/consumer.py", line 248, in try_consume_file
    document_parser.parse(self.path, mime_type, self.filename)
  File "/opt/paperless/src/paperless_tesseract/parsers.py", line 281, in parse
    raise ParseError(f"{e.__class__.__name__}: {str(e)}")
documents.parsers.ParseError: OSError: [Errno 122] Disk quota exceeded: '/tmp/ocrmypdf.io.y8r2t25a/optimize.opt.pdf'
[2021-05-09 12:49:38,419] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Scan_.pdf to the task queue.
[2021-05-09 12:49:38,594] [INFO] [paperless.consumer] Consuming Scan_.pdf
[2021-05-09 12:49:38,597] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 12:49:38,612] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 12:49:38,617] [DEBUG] [paperless.consumer] Parsing Scan_.pdf...
[2021-05-09 12:49:38,778] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 12:49:38,947] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/opt/paperless/src/../consume/Scan_.pdf', 'output_file': '/tmp/paperless/paperless-jv6q3mop/archive.pdf', 'use_threads': True, 'jobs': '4', 'language': 'eng+nld', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-jv6q3mop/sidecar.txt'}
[2021-05-09 12:50:06,748] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-09 12:50:06,750] [DEBUG] [paperless.consumer] Generating thumbnail for Scan_.pdf...
[2021-05-09 12:50:06,760] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-jv6q3mop/archive.pdf[0] /tmp/paperless/paperless-jv6q3mop/convert.png
[2021-05-09 12:50:06,844] [WARNING] [paperless.parsing] Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!
[2021-05-09 12:50:07,562] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-jv6q3mop/gs_out.png /tmp/paperless/paperless-jv6q3mop/convert_gs.png
[2021-05-09 12:50:07,774] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-jv6q3mop/convert_gs.png -out /tmp/paperless/paperless-jv6q3mop/thumb_optipng.png
[2021-05-09 12:50:25,501] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-05-09 12:50:25,512] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-09 12:50:25,655] [DEBUG] [paperless.consumer] Deleting file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 12:50:25,673] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-jv6q3mop
[2021-05-09 12:50:25,674] [INFO] [paperless.consumer] Document 2021-03-01 Scan_ consumption finished
[2021-05-09 16:49:14,587] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Scan_.pdf to the task queue.
[2021-05-09 16:49:14,751] [INFO] [paperless.consumer] Consuming Scan_.pdf
[2021-05-09 16:49:14,755] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 16:49:14,776] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 16:49:14,783] [DEBUG] [paperless.consumer] Parsing Scan_.pdf...
[2021-05-09 16:49:15,019] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 16:49:15,205] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/opt/paperless/src/../consume/Scan_.pdf', 'output_file': '/tmp/paperless/paperless-k5478k87/archive.pdf', 'use_threads': True, 'jobs': '4', 'language': 'eng+nld', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-k5478k87/sidecar.txt'}
[2021-05-09 16:49:47,627] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-09 16:49:47,630] [DEBUG] [paperless.consumer] Generating thumbnail for Scan_.pdf...
[2021-05-09 16:49:47,639] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-k5478k87/archive.pdf[0] /tmp/paperless/paperless-k5478k87/convert.png
[2021-05-09 16:49:47,659] [WARNING] [paperless.parsing] Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!
[2021-05-09 16:49:48,236] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-k5478k87/gs_out.png /tmp/paperless/paperless-k5478k87/convert_gs.png
[2021-05-09 16:49:48,475] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-k5478k87/convert_gs.png -out /tmp/paperless/paperless-k5478k87/thumb_optipng.png
[2021-05-09 16:50:05,768] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-05-09 16:50:05,774] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-09 16:50:05,977] [DEBUG] [paperless.consumer] Deleting file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 16:50:05,995] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-k5478k87
[2021-05-09 16:50:05,996] [INFO] [paperless.consumer] Document 2021-04-04 Scan_ consumption finished
[2021-05-09 23:21:54,711] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Scan_.pdf to the task queue.
[2021-05-09 23:21:54,839] [INFO] [paperless.consumer] Consuming Scan_.pdf
[2021-05-09 23:21:54,841] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 23:21:54,856] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 23:21:54,861] [DEBUG] [paperless.consumer] Parsing Scan_.pdf...
[2021-05-09 23:21:54,995] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 23:21:55,172] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/opt/paperless/src/../consume/Scan_.pdf', 'output_file': '/tmp/paperless/paperless-su7bob8g/archive.pdf', 'use_threads': True, 'jobs': '4', 'language': 'eng+nld', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-su7bob8g/sidecar.txt'}
[2021-05-09 23:22:21,652] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-09 23:22:21,653] [DEBUG] [paperless.consumer] Generating thumbnail for Scan_.pdf...
[2021-05-09 23:22:21,665] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-su7bob8g/archive.pdf[0] /tmp/paperless/paperless-su7bob8g/convert.png
[2021-05-09 23:22:21,685] [WARNING] [paperless.parsing] Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!
[2021-05-09 23:22:22,177] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-su7bob8g/gs_out.png /tmp/paperless/paperless-su7bob8g/convert_gs.png
[2021-05-09 23:22:22,387] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-su7bob8g/convert_gs.png -out /tmp/paperless/paperless-su7bob8g/thumb_optipng.png
[2021-05-09 23:22:39,431] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-05-09 23:22:39,439] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-09 23:22:39,545] [DEBUG] [paperless.consumer] Deleting file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 23:22:39,562] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-su7bob8g
[2021-05-09 23:22:39,564] [INFO] [paperless.consumer] Document 2021-03-01 Scan_ consumption finished
[2021-05-09 23:22:42,977] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-05-09 23:24:16,453] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Scan_.pdf to the task queue.
[2021-05-09 23:24:16,585] [INFO] [paperless.consumer] Consuming Scan_.pdf
[2021-05-09 23:24:16,588] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 23:24:16,603] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 23:24:16,608] [DEBUG] [paperless.consumer] Parsing Scan_.pdf...
[2021-05-09 23:24:16,773] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 23:24:16,946] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/opt/paperless/src/../consume/Scan_.pdf', 'output_file': '/tmp/paperless/paperless-axks58w4/archive.pdf', 'use_threads': True, 'jobs': '4', 'language': 'eng+nld', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-axks58w4/sidecar.txt'}
[2021-05-09 23:24:50,052] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-09 23:24:50,055] [DEBUG] [paperless.consumer] Generating thumbnail for Scan_.pdf...
[2021-05-09 23:24:50,065] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-axks58w4/archive.pdf[0] /tmp/paperless/paperless-axks58w4/convert.png
[2021-05-09 23:24:50,085] [WARNING] [paperless.parsing] Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!
[2021-05-09 23:24:50,697] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-axks58w4/gs_out.png /tmp/paperless/paperless-axks58w4/convert_gs.png
[2021-05-09 23:24:50,952] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-axks58w4/convert_gs.png -out /tmp/paperless/paperless-axks58w4/thumb_optipng.png
[2021-05-09 23:25:09,449] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-05-09 23:25:09,457] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-09 23:25:09,538] [DEBUG] [paperless.consumer] Deleting file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 23:25:09,556] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-axks58w4
[2021-05-09 23:25:09,557] [INFO] [paperless.consumer] Document 2021-04-04 Scan_ consumption finished
[2021-05-09 23:26:58,925] [INFO] [paperless.management.consumer] Adding /opt/paperless/src/../consume/Scan_.pdf to the task queue.
[2021-05-09 23:26:59,074] [INFO] [paperless.consumer] Consuming Scan_.pdf
[2021-05-09 23:26:59,077] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-09 23:26:59,092] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-09 23:26:59,099] [DEBUG] [paperless.consumer] Parsing Scan_.pdf...
[2021-05-09 23:26:59,337] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 23:26:59,522] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/opt/paperless/src/../consume/Scan_.pdf', 'output_file': '/tmp/paperless/paperless-s0bb93mj/archive.pdf', 'use_threads': True, 'jobs': '4', 'language': 'eng+nld', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-s0bb93mj/sidecar.txt'}
[2021-05-09 23:27:32,656] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-09 23:27:32,658] [DEBUG] [paperless.consumer] Generating thumbnail for Scan_.pdf...
[2021-05-09 23:27:32,668] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-s0bb93mj/archive.pdf[0] /tmp/paperless/paperless-s0bb93mj/convert.png
[2021-05-09 23:27:32,701] [WARNING] [paperless.parsing] Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!
[2021-05-09 23:27:33,470] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-s0bb93mj/gs_out.png /tmp/paperless/paperless-s0bb93mj/convert_gs.png
[2021-05-09 23:27:33,723] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-s0bb93mj/convert_gs.png -out /tmp/paperless/paperless-s0bb93mj/thumb_optipng.png
[2021-05-09 23:27:53,099] [DEBUG] [paperless.classifier] Document classification model does not exist (yet), not performing automatic matching.
[2021-05-09 23:27:53,105] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-09 23:27:53,254] [DEBUG] [paperless.consumer] Deleting file /opt/paperless/src/../consume/Scan_.pdf
[2021-05-09 23:27:53,273] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-s0bb93mj
[2021-05-09 23:27:53,275] [INFO] [paperless.consumer] Document 2021-04-04 Scan_ consumption finished

Relevant information

shamoon commented 3 years ago

This has been reported a few times, date parsing isnt amazing at the moment but not exactly a bug, I think. It does need improvement, think thats kinda "on the list". See https://github.com/jonaswinkler/paperless-ng/discussions/593 , for example

MrAlfabet commented 3 years ago

This has been reported a few times, date parsing isnt amazing at the moment but not exactly a bug, I think. It does need improvement, think thats kinda "on the list". See #593 , for example

I could understand if DD/MM and MM/DD would get messed up, but these dates have nothing to do with the document creation date. How could two documents, created seconds from eachother get different creation dates? I see none of this reported in the other thread, so I thought this example would be nice to have when the whole thing gets revamped.

jonaswinkler commented 3 years ago

Paperless does not use the "created" date provided by filesystem metadata, it scans the content of the document for dates and uses that to assign the "created" field.

MrAlfabet commented 3 years ago

Paperless does not use the "created" date provided by filesystem metadata, it scans the content of the document for dates and uses that to assign the "created" field.

Ah, in that case I think I should add that as a feature request? That kind of explains a lot.

jonaswinkler commented 3 years ago

Most of the time (pretty much always for scanned documents), the date created on the filesystem will not match the date the document was actually created (such as the date of an invoice).

MrAlfabet commented 3 years ago

I'd have to agree. Perhaps the addition of a 'scanned' or 'file created' date would be the feature I'm looking for? The only time this feature would be useful is if paperless is consuming a folder of documents scanned (way) before the first run. Since the 'added' date for (most of) these documents seems to be incorrect, I now have no way of organizing these documents by 'scanned' date.

Right now I'm scanning and instantly processing everything that comes in the mail (that's relevant), and I can organize them by 'date added' to get the order I want, but not for my 3 year backlog of scans that was consumed on first run.

Edit: this might be too off-topic, but I've noticed paperless-ng also uses quite a bit of disk space when processing documents. The traceback you see in the logs in the OP was triggered when a 20MB pdf needed over 800MB of diskspace to be processed, something my container was not configured for. Note that the data directory is mounted elsewhere, so this is used by some temp/processing folder. It would be nice to see a mention of this under 'resource usage' in the readme.

rknightion commented 3 years ago

One potential idea: A few of the scanners I've used put the "scanned" date in the pdf metadata/exif. I know that's not necessarily the document production date, but an optional setting/method to use the "scanned" date as per the file metadata would be hugely appreciated (in my cases it'd improve the accuracy).