jgillula / paperless-ngx-postprocessor

A powerful and customizable postprocessing script for paperless-ngx
GNU Affero General Public License v3.0
97 stars 10 forks source link

regex_sub issue #20

Open cookyr opened 5 months ago

cookyr commented 5 months ago
          Hi,

I want to use regex_sub to extract time information from filenames. But this function is not working for me, same for example.yml. I tried in the metadata_postprocessing section:

yy: '{{ "2099-05-4-kaese" | regex_sub(".(\d{4}).","\1") }}' title: 'test-{{yy}}-end'

There's no error in webserver's log but document arrves in paperless as 'test--end' I also tried to use re.sub directly, as far as I understood jinja this could be feasible?!?

Any idea?

Best regards Rüdiger

Originally posted by @cookyr in https://github.com/jgillula/paperless-ngx-postprocessor/issues/3#issuecomment-2139432017

jgillula commented 5 months ago

Weird--it works for me, although your regex doesn't actually do anything for that particular input string (i.e. I get yy as 2099-05-4-kaese, since the regex doesn't match and doesn't substitute anything, and so the title is set to test-2099-05-4-kaese-end).

Could you set the verbose-level to DEBUG and check the logs again? You can do this one of two ways:

  1. In Paperless-ngx's docker-compose.env file, add the line PNGX_POSTPROCESSOR_VERBOSE=DEBUG, (e.g. right below the PAPERLESS_POST_CONSUME_SCRIPT=... line you added to hook the postprocessor), or
  2. If you're running the management script on the command line to test things, add the command line flag --verbose DEBUG

There's going to be a lot of logs (it's very verbose), but the interesting line will probably start with Updating 'yy' using template {{ "2099-05-4-kaese" | regex_sub(".(\d{4}).","\1") }} and metadata...

cookyr commented 5 months ago

Thanks for your answer!

There was a typo or copy/paste error in my pattern: 2 asterixs have disappaered. It looks like this in my script: yy: '{{ "2099-05-4-kaese" | regex_sub(".*(\d{4}).*","\1") }}' and I expected yy yields "2099".

To me this was weird as the regex_match in the match section worked fine.

I'll try the debugging switches next time.

Best regards Rüdiger

jgillula commented 5 months ago

Aha! I tried it with your correct regex, and you're right, it matches. I think the issue is the \1 you're substituting--it's getting interpreted as a literal \1, i.e. a single character (just like \n is a single newline character, or \t is a single tab character).

If we just do this in the Python interpreter

>>> import regex
>>> regex.sub(".*(\d{4}).*", "\1", "2099-05-4-kaese")
'\x01'

In other words, it's substituting with a literal \1 character, not the backslash-referenced first matching group, like it should.

The solution seems to be to write your substitution string as "\\1".

>>> regex.sub(".*(\d{4}).*", "\\1", "2099-05-4-kaese")
'2099'

If you can confirm that works, let me know and I'll close out this issue. 🙂