the-paperless-project / paperless

Scan, index, and archive all of your paper documents
GNU General Public License v3.0
7.85k stars 498 forks source link

TypeError: Required argument 'day' (pos 3) not found #304

Closed Findus23 closed 6 years ago

Findus23 commented 6 years ago

Probably related to #291

I have a quite complex PDF (governmental form) with input fields I'd like to add to paperless. When I add it, paperless crashes with the following exception:

Feb 13 12:45:22 standpc python[22419]: Parsers available: RasterisedDocumentParser Feb 13 12:45:22 standpc python[22419]: Consuming /media/daten/Dokumente/Eingang/nameoffile.pdf Feb 13 12:45:23 standpc python[22419]: Skipping OCR, using Text from PDF Feb 13 12:45:23 standpc python[22419]: Traceback (most recent call last): Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/manage.py", line 18, in Feb 13 12:45:23 standpc python[22419]: execute_from_command_line(sys.argv) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/init.py", line 364, in execute_from_command_line Feb 13 12:45:23 standpc python[22419]: utility.execute() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/init.py", line 356, in execute Feb 13 12:45:23 standpc python[22419]: self.fetch_command(subcommand).run_from_argv(self.argv) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv Feb 13 12:45:23 standpc python[22419]: self.execute(*args, cmd_options) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute Feb 13 12:45:23 standpc python[22419]: output = self.handle(*args, *options) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/documents/management/commands/document_consumer.py", line 57, in handle Feb 13 12:45:23 standpc python[22419]: self.loop() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/documents/management/commands/document_consumer.py", line 67, in loop Feb 13 12:45:23 standpc python[22419]: self.file_consumer.consume() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/documents/consumer.py", line 121, in consume Feb 13 12:45:23 standpc python[22419]: date = parsed_document.get_date() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/paperless_tesseract/parsers.py", line 223, in get_date Feb 13 12:45:23 standpc python[22419]: 'RETURN_AS_TIMEZONE_AWARE': True}) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/conf.py", line 81, in wrapper Feb 13 12:45:23 standpc python[22419]: return f(args, kwargs) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/init.py", line 53, in parse Feb 13 12:45:23 standpc python[22419]: data = parser.get_date_data(date_string, date_formats) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 404, in get_date_data Feb 13 12:45:23 standpc python[22419]: locale, date_string, date_formats, settings=self._settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 177, in parse Feb 13 12:45:23 standpc python[22419]: return instance._parse() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 187, in _parse Feb 13 12:45:23 standpc python[22419]: date_obj = parser() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 209, in _try_parser Feb 13 12:45:23 standpc python[22419]: self._get_translated_date(), settings=self._settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/conf.py", line 81, in wrapper Feb 13 12:45:23 standpc python[22419]: return f(*args, kwargs) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date_parser.py", line 26, in parse Feb 13 12:45:23 standpc python[22419]: date_obj, period = parse(date_string, settings=settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 70, in parse Feb 13 12:45:23 standpc python[22419]: raise exceptions.pop(-1) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 64, in parse Feb 13 12:45:23 standpc python[22419]: res = parser(datestring, settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 438, in parse Feb 13 12:45:23 standpc python[22419]: po = cls(tokens.tokenize(), settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 272, in init Feb 13 12:45:23 standpc python[22419]: datetime(params) Feb 13 12:45:23 standpc python[22419]: TypeError: Required argument 'day' (pos 3) not found Feb 13 12:45:23 standpc systemd[1]: paperless-consumer.service: Main process exited, code=exited, status=1/FAILURE Feb 13 12:45:23 standpc systemd[1]: paperless-consumer.service: Failed with result 'exit-code'.

The file has dates structured like 130218, 2018, 20180213 and 13.02.2018 (all as value of input fields)

paperless doesn't have to detect those dates, but it should catch the exception and fallback to using the default date

BastianPoe commented 6 years ago

Please try out if #302 fixes your problem.

Sent from my iPhone

Am 13.02.2018 um 12:55 schrieb Lukas Winkler notifications@github.com:

Probably related to #291

I have a quite complex PDF (governmental form) with input fields I'd like to add to paperless. When I add it, paperless crashes with the following exception:

Feb 13 12:45:22 standpc python[22419]: Parsers available: RasterisedDocumentParser Feb 13 12:45:22 standpc python[22419]: Consuming /media/daten/Dokumente/Eingang/nameoffile.pdf Feb 13 12:45:23 standpc python[22419]: Skipping OCR, using Text from PDF Feb 13 12:45:23 standpc python[22419]: Traceback (most recent call last): Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/manage.py", line 18, in Feb 13 12:45:23 standpc python[22419]: execute_from_command_line(sys.argv) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/init.py", line 364, in execute_from_command_line Feb 13 12:45:23 standpc python[22419]: utility.execute() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/init.py", line 356, in execute Feb 13 12:45:23 standpc python[22419]: self.fetch_command(subcommand).run_from_argv(self.argv) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv Feb 13 12:45:23 standpc python[22419]: self.execute(*args, cmd_options) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute Feb 13 12:45:23 standpc python[22419]: output = self.handle(*args, *options) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/documents/management/commands/document_consumer.py", line 57, in handle Feb 13 12:45:23 standpc python[22419]: self.loop() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/documents/management/commands/document_consumer.py", line 67, in loop Feb 13 12:45:23 standpc python[22419]: self.file_consumer.consume() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/documents/consumer.py", line 121, in consume Feb 13 12:45:23 standpc python[22419]: date = parsed_document.get_date() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/git/paperless/src/paperless_tesseract/parsers.py", line 223, in get_date Feb 13 12:45:23 standpc python[22419]: 'RETURN_AS_TIMEZONE_AWARE': True}) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/conf.py", line 81, in wrapper Feb 13 12:45:23 standpc python[22419]: return f(args, kwargs) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/init.py", line 53, in parse Feb 13 12:45:23 standpc python[22419]: data = parser.get_date_data(date_string, date_formats) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 404, in get_date_data Feb 13 12:45:23 standpc python[22419]: locale, date_string, date_formats, settings=self._settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 177, in parse Feb 13 12:45:23 standpc python[22419]: return instance._parse() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 187, in _parse Feb 13 12:45:23 standpc python[22419]: date_obj = parser() Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 209, in _try_parser Feb 13 12:45:23 standpc python[22419]: self._get_translated_date(), settings=self._settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/conf.py", line 81, in wrapper Feb 13 12:45:23 standpc python[22419]: return f(*args, kwargs) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date_parser.py", line 26, in parse Feb 13 12:45:23 standpc python[22419]: date_obj, period = parse(date_string, settings=settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 70, in parse Feb 13 12:45:23 standpc python[22419]: raise exceptions.pop(-1) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 64, in parse Feb 13 12:45:23 standpc python[22419]: res = parser(datestring, settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 438, in parse Feb 13 12:45:23 standpc python[22419]: po = cls(tokens.tokenize(), settings) Feb 13 12:45:23 standpc python[22419]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 272, in init Feb 13 12:45:23 standpc python[22419]: datetime(params) Feb 13 12:45:23 standpc python[22419]: TypeError: Required argument 'day' (pos 3) not found Feb 13 12:45:23 standpc systemd[1]: paperless-consumer.service: Main process exited, code=exited, status=1/FAILURE Feb 13 12:45:23 standpc systemd[1]: paperless-consumer.service: Failed with result 'exit-code'.

The file has dates structured like 130218, 2018, 20180213 and 13.02.2018 (all as value of input fields)

paperless doesn't have to detect those dates, but it should catch the exception and fallback to using the default date

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Findus23 commented 6 years ago

@BastianPoe Yes I still get the same error after merging bugfix/extend_regex_to_find_more_dates and restarting paperless. (Just with a different linenumber here obviously)

File "/home/lukas/git/paperless/src/paperless_tesseract/parsers.py", line 228, in get_date

BastianPoe commented 6 years ago

Please provide all the dates contained in the output of pdftotext or tesseract. I will extend the unit Test collection and enhance the regex.

Findus23 commented 6 years ago

I did a lot of testing with converting the output of pdftotext and removing data until it works.

This is the most minimal example I was able to create:

Wohnort

3100

IBAN

AT87 4534

1234

1234 5678

BIC

(Not the real numbers but others that also trigger the error)

PDF: https://seafile.lw1.at/f/9b19d14037054e349d5f/?dl=1

I think it's the line starting with AT as other numbers seem to work:

Detected document date 02/12/34 based on string AT12 1234 Detected document date 02/12/34 based on string AT12 3534

Complete log:

Feb 13 16:23:31 standpc python[4302]: Parsers available: RasterisedDocumentParser
Feb 13 16:23:31 standpc python[4302]: Consuming /media/daten/Dokumente/Eingang/test.pdf
Feb 13 16:23:31 standpc python[4302]: convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `/tmp/paperless/paperless-ybflu5sq/convert-%04d.p ng' @ warning/png.c/MagickPNGWarningHandler/1654.
Feb 13 16:23:37 standpc python[4302]: [image2 @ 0x5609007496e0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
Feb 13 16:23:37 standpc python[4302]: [image2 @ 0x5609007496e0] Encoder did not produce proper pts, making some up.
Feb 13 16:23:37 standpc python[4302]: Processing sheet #1: /tmp/paperless/paperless-ybflu5sq/convert-0000.pnm -> /tmp/paperless/paperless-ybflu5sq/convert-0000.unpaper.pnm
Feb 13 16:23:37 standpc python[4302]: OCRing the document
Feb 13 16:23:37 standpc python[4302]: Parsing for deu
Feb 13 16:23:38 standpc python[4302]: Parsing for eng
Feb 13 16:23:39 standpc python[4302]: Traceback (most recent call last):
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/git/paperless/src/manage.py", line 18, in
Feb 13 16:23:39 standpc python[4302]: execute_from_command_line(sys.argv)
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/init.py", line 364, in execute_from_co mmand_line
Feb 13 16:23:39 standpc python[4302]: utility.execute()
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/init.py", line 356, in execute Feb 13 16:23:39 standpc python[4302]: self.fetch_command(subcommand).run_from_argv(self.argv) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv Feb 13 16:23:39 standpc python[4302]: self.execute(*args, cmd_options) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute Feb 13 16:23:39 standpc python[4302]: output = self.handle(*args, *options) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/git/paperless/src/documents/management/commands/document_consumer.py", line 57, in handle
Feb 13 16:23:39 standpc python[4302]: self.loop()
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/git/paperless/src/documents/management/commands/document_consumer.py", line 67, in loop
Feb 13 16:23:39 standpc python[4302]: self.file_consumer.consume()
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/git/paperless/src/documents/consumer.py", line 121, in consume
Feb 13 16:23:39 standpc python[4302]: date = parsed_document.get_date()
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/git/paperless/src/paperless_tesseract/parsers.py", line 228, in get_date
Feb 13 16:23:39 standpc python[4302]: 'RETURN_AS_TIMEZONE_AWARE': True})
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/conf.py", line 81, in wrapper
Feb 13 16:23:39 standpc python[4302]: return f(
args,
kwargs)
Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/init.py", line 53, in parse Feb 13 16:23:39 standpc python[4302]: data = parser.get_date_data(date_string, date_formats) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 404, in get_date_data Feb 13 16:23:39 standpc python[4302]: locale, date_string, date_formats, settings=self._settings) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 177, in parse Feb 13 16:23:39 standpc python[4302]: return instance._parse() Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 187, in _parse Feb 13 16:23:39 standpc python[4302]: date_obj = parser() Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date.py", line 209, in _try_parser Feb 13 16:23:39 standpc python[4302]: self._get_translated_date(), settings=self._settings) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/conf.py", line 81, in wrapper Feb 13 16:23:39 standpc python[4302]: return f(*args, kwargs) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/date_parser.py", line 26, in parse Feb 13 16:23:39 standpc python[4302]: date_obj, period = parse(date_string, settings=settings) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 70, in parse Feb 13 16:23:39 standpc python[4302]: raise exceptions.pop(-1) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 64, in parse Feb 13 16:23:39 standpc python[4302]: res = parser(datestring, settings) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 438, in parse Feb 13 16:23:39 standpc python[4302]: po = cls(tokens.tokenize(), settings) Feb 13 16:23:39 standpc python[4302]: File "/home/lukas/.virtualenvs/paperless/lib/python3.6/site-packages/dateparser/parser.py", line 272, in init Feb 13 16:23:39 standpc python[4302]: datetime(params) Feb 13 16:23:39 standpc python[4302]: TypeError: Required argument 'day' (pos 3) not found

To make sure it isn't the OCR I added

print(text)                                                                                                                                                           
print("---------------------------")                                                                                                                                  
print(m.group(0))                                                                                                                                                     

into the parser and it returns:

3100                                                                                                                                    
IBAN                                                                                                                                    
AT87 4534                                                                                                                               
1234                                                                                                                                    
1234 5678                                                                                                                               
BIC                                                                                                                                     
---------------------------                                                                                                             
AT87 4534    
ddddavidmartin commented 6 years ago

Just a thought: ideally the date parsing would be able to handle all date formats eventually. But in general, would it be worth catching all errors that the dateparser can raise and simply ignoring the parsing for that document if anything goes wrong? Maybe with a note in the log to raise it as an issue. Because having the consumer crash is pretty bad and definitely should never happen in my opinion.

In any case, I much appreciate the efforts @BastianPoe!

BastianPoe commented 6 years ago

So, good news: I have a unit test to reproduce the bug (see #302).

Next steps are: a) add proper exception handling to not have document consumption stop whenever something goes wrong b) fix the problem at hand

Thank you very much for your effort to reproduce the problem!

BastianPoe commented 6 years ago

Ok, @ddddavidmartin and @Findus23, could you please give #302 another try and see, if this works for you? The error handling is improved and the regex is more selective.

ddddavidmartin commented 6 years ago

Thanks @BastianPoe! I just gave it a quick try. My other computer died so I just tried it locally. For some reason it did print the date in the wrong order in the log, but it stores it correctly.

Detected document date 04/05/81 based on string 05/04/1981

In the code it is date.strftime("%x") which is meant to use the appropriate local date format so this is most likely an issue on my end. Just mentioning it in case someone else stumbles across it.

I did not try the error handling explicitly but it looks good to me. 👍

BastianPoe commented 6 years ago

Sounds like we can close this issue now after #302 has been merged. @ddddavidmartin, @Findus23 you ok with that?

Findus23 commented 6 years ago

Ah, got confused what issue you mean 😄 I’ll test it tomorrow and will report back.

Findus23 commented 6 years ago

Okay, it took me a bit longer, but I just tested it and it worked (and even detected the correct date)

Thanks to all of you for this great software.