marcelbrueckner / paperless.sh

(Not only) Shell scripts around Paperless-ngx
https://paperless.sh/
MIT License
10 stars 1 forks source link

Nothing to do #4

Closed docb7 closed 3 months ago

docb7 commented 3 months ago

Hi, first of all, thanks a lot for the scripts. I am an absolute beginner with Linux and Python but was able to install everything (paperless running in a docker container on sinology) - although it took me around 5 hours to get everything kinda working ;-)

But something still is not working, and I think the filter is the problem, could you please help me, how I need to replace the -"$DOCUMENT_ARCHIVE_FILENAME"? I am not sure how to do that and I think that is why my log says:

[2024-05-30 06:37:03,725] [INFO] [paperless.consumer] Config: "/tmp/organize.config.Rfk6SG.yml"
[2024-05-30 06:37:03,726] [INFO] [paperless.consumer]
[2024-05-30 06:37:03,726] [INFO] [paperless.consumer] Nothing to do
[2024-05-30 06:37:03,726] [INFO] [paperless.consumer]
[2024-05-30 06:37:03,726] [INFO] [paperless.consumer]```

This is my organize.config.yml.tpl

```# YAML ANCHORS
# This filters the exact document that has been consumed
.locations: &current_document
  - path: "{env.DOCUMENT_ARCHIVE_DIR}"
    filter:
      # Needs to be replaced with e.g. `envsubst`
      # as organize doesn't replace environment placeholders in filter
      - "$DOCUMENT_ARCHIVE_FILENAME"

# RULES
rules:
  - name: "Rechnungen"
    locations: *current_document
    filters:
      - filecontent: "Rechnung"
      - filecontent: "(?P<title>Titel ändern)"
      - filecontent: 'Amount due.*(?P<amount>\d{2}\.\d{2})'
    actions:
      - echo: "Eigenes Script hat ausgelöst"
#      - shell: "./pngx-update-document.py --url http://localhost:8000 --document-id {env.DOCUMENT_ID} --title '{filecontent.title}' --custom-field-id 1 --custom-field-value {filecontent.amount}"
      - echo: "{shell.output}"

This is my post-consumtion-wrapper.sh:



# paperless-ngx post-consumption script
#
# https://docs.paperless-ngx.com/advanced_usage/#post-consume-script
#

SCRIPT_PATH=$(readlink -f "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")

# Add additional information to document
# Make sure organize-tool and poppler-utils has been installed
# on your system (resp. container, via custom-cont-init.d)

# organize-tool doesn't accept full file path as argument
# but expects directory and filename pattern without extension instead
export DOCUMENT_ARCHIVE_FILENAME=$(basename "${DOCUMENT_ARCHIVE_PATH}")
export DOCUMENT_ARCHIVE_DIR=$(dirname "${DOCUMENT_ARCHIVE_PATH}")
# Have something written into Paperless-ngx logs
echo "Hello. Post-consumption script here."

# While organize supports environment variables as placeholders in it's configuration,
# it's not yet supported everywhere in the configuration (e.g. filters),
# thus leveraging envsubst to replace environment placeholders
ORGANIZE_CONFIG_PATH=$(mktemp --suffix=.yml ${TMPDIR:-/tmp}/organize.config.XXXXXX)
envsubst < "${SCRIPT_DIR}/organize/organize.config.yml.tpl" > "${ORGANIZE_CONFIG_PATH}"

# Execute configured actions
# Add `--format errorsonly` to suppress most of organize's output in logs
organize run "${ORGANIZE_CONFIG_PATH}" --working-dir "${SCRIPT_DIR}/organize"

# Clean up
rm -f "${ORGANIZE_CONFIG_PATH}"
echo "

A document with an id of ${DOCUMENT_ID} was just consumed.  I know the
following additional information about it:
* Generated File Name: ${DOCUMENT_FILE_NAME}
* Archive Path: ${DOCUMENT_ARCHIVE_PATH}
* Source Path: ${DOCUMENT_SOURCE_PATH}
* Created: ${DOCUMENT_CREATED}
* Added: ${DOCUMENT_ADDED}
* Modified: ${DOCUMENT_MODIFIED}
* Thumbnail Path: ${DOCUMENT_THUMBNAIL_PATH}
* Download URL: ${DOCUMENT_DOWNLOAD_URL}
* Thumbnail URL: ${DOCUMENT_THUMBNAIL_URL}
* Correspondent: ${DOCUMENT_CORRESPONDENT}
* Tags: ${DOCUMENT_TAGS}

It was consumed with the passphrase ${PASSPHRASE}
"```

Or didn't I put the .env in the right file? I've put it in the .env for the docker-compose (file .env in the same path as docker-compose.yml and in the docker-compose.yml I have the line     env_file: docker-compose.env).

And by the way: am I right that the third line of "filecontent" is the regex that gets me the amount of the bill if named "Gesamtbetrag" so I would need to adopt that, if I want to search for "Gesamt:" or "Brutto "?
Best regards
Ben
marcelbrueckner commented 3 months ago

Hi @docb7,

$DOCUMENT_ARCHIVE_FILENAME is determined within your post-consumption script and replaced with the value of the corresponding environment variable using envsubst before calling organize, so you should be fine with this as is.

For debug purposes you could comment the removal of the organize config file at the end of the consumption script and inspect it's content within the container. Or just cat "${ORGANIZE_CONFIG_PATH}" to have it output in your paperless-ngx logs.

I guess the reason for organize telling you it has nothing to do is because of your filecontent filter. You have three filters defined and all of them have to match for organize to proceed with the defined actions. I doubt that your invoice contains the phrase "Titel ändern". Please see my comments in the YAML below.

    filters:
        # Looks for the word "Rechnung" literally
        - filecontent: "Rechnung"
        # Looks for the phrase "Titel ändern" literally and saves it into a variable called `title`
        - filecontent: "(?P<title>Titel ändern)"
        # Looks for the phrase "Amount due", followed by anything, followed by two two-digit numbers separated by a period and saves it into a variable called `amount`
        # The phrase needs to be adapted to the language and words printed on your invoice (like you mentioned already)
        - filecontent: 'Amount due.*(?P<amount>\d{2}\.\d{2})'
docb7 commented 3 months ago

Great, thank you so much! Now I understand the system and I was able to use it (almost) correctly. I still have two errors:

  1. the amount is not read correctly. I changed the regex to \d*.?\d+\,\d{2} (and tested it with https://regex101.com) because the amount is usually something like 1.471,60 (like in the pdf I used). But the amount is not stored correctly (the variable seems to contain just 1,60).
  2. even this potentially wrong variable cannot be stored in the custom-field-id 1 which has the Name "Betrag" and is defined as a currency.

Here the log:

[2024-05-30 09:36:18,007] [INFO] [paperless.consumer]     - (echo) Eigenes Script hat ausgelöst - Betrag: 1,60
[2024-05-30 09:36:18,007] [INFO] [paperless.consumer]     - (shell) $ ./pngx-update-document.py --url http://10.11.33.10:34343
[2024-05-30 09:36:18,008] [INFO] [paperless.consumer] --document-id 102 --custom-field-id 1 --custom-field-value 1,60
[2024-05-30 09:36:18,008] [INFO] [paperless.consumer]     - (shell) ERROR! Command './pngx-update-document.py --url
[2024-05-30 09:36:18,008] [INFO] [paperless.consumer] http://10.10.10.10:3434 --document-id 102 --custom-field-id 1
[2024-05-30 09:36:18,008] [INFO] [paperless.consumer] --custom-field-value 1,60' returned non-zero exit status 1.
[2024-05-30 09:36:18,008] [INFO] [paperless.consumer]
[2024-05-30 09:36:18,009] [INFO] [paperless.consumer] success 0 / fail 1

And here my new rule:

rules:
  - name: "Rechnungen"
    locations: *current_document
    filters:
      - filecontent: "Rechnung"
      - filecontent: 'Gesamt.*(?P<amount>\d*\.?\d+\,\d{2})'
    actions:
      - echo: "Eigenes Script hat ausgelöst - Betrag: {filecontent.amount}"
#      - shell: "./pngx-update-document.py --url http://10.11.33.10:34343 --document-id {env.DOCUMENT_ID} --title '{filecontent.title}' --custom-field-id 1 --custom-field-value {filecontent.amount}"
      - shell: "./pngx-update-document.py --document-id {env.DOCUMENT_ID} --custom-field-id 1 --custom-field-value {filecontent.amount}"
      - echo: "{shell.output}"
docb7 commented 3 months ago

Hi, just in case someone reads this, after 5 hours of trial and errors, I was able to solve it:

  1. my regex was not good. This one works for me: - filecontent: '\b(Gesamt|Total|Zahlbetrag|Rechnungsbetrag|Endbetrag|Gesamtsumme|Bruttobetrag)\b(\D*)(?P<amount>(\d*\.?\d+\,\d{2}))'

  2. the pngx-update-document.py did not work, because it was not able to retrieve the paperless url and the token from the environmental variables. I solved that, by adding those arguments when calling the script: - shell: "./pngx-update-document.py --url http://10.10.10.10:34355 --auth-token 343mytoken890 --document-id {env.DOCUMENT_ID} --custom-field-id 1 --custom-field-value {filecontent.amount}"

  3. And something else: to ensure, that the document custom field can be updated when having German currency formatting (which is like 1.1233,45, I needed to add the following lines to pngx-update-document.py:

    # Update custom field
    # Only if both --custom-field-id and --custom-field-value have been specified
    if all(param is not None for param in [args.custom_field_id, args.custom_field_value]):
    #For my ngx, the custom field with id 1 is currency, therefore I want to replace the funny German format (e.g. 1.1234,56)
    if args.custom_field_id == 1:
        temp_value = args.custom_field_value.replace('.', '')
        temp_value = temp_value.replace(',', '.')
    else:
        temp_value = args.custom_field_value
    new_field = {
        "field": args.custom_field_id,
        "value": temp_value
    }

    Now it works like a charm ;-)

marcelbrueckner commented 3 months ago

I need to admit that I'm by no means an expert when it comes to regular expressions. I play around with a combination of regex101.com and stackoverflow.com :)

Did you check the OCR content of your file? Sometimes spaces or other characters are recognized where there are none, so the value 1.471,60 might have been recognized as 1.47 1,60 or so. If not, you should check if your regular expression is greedy enough.

Your other issue is solved relatively easy: Paperless-ngx expects the decimal separator to be a point (.), thus you need to convert your value. I'm using the python filter for this, so your rule could look like the following:

rules:
  - name: "Rechnungen"
    locations: *current_document
    filters:
      - filecontent: "Rechnung"
      - filecontent: 'Gesamt.*(?P<amount>\d*\.?\d+\,\d{2})'
      # Remove thousands separator, replace decimal comma with decimal point
      - python: |
          return {
              "amount": float(filecontent['amount'].replace('.','').replace(',','.')),
          }
    actions:
      - echo: "Eigenes Script hat ausgelöst - Betrag: {filecontent.amount}"
#      - shell: "./pngx-update-document.py --url http://10.11.33.10:34343 --document-id {env.DOCUMENT_ID} --title '{filecontent.title}' --custom-field-id 1 --custom-field-value {filecontent.amount}"
      - shell: "./pngx-update-document.py --document-id {env.DOCUMENT_ID} --custom-field-id 1 --custom-field-value {python.amount}"
      - echo: "{shell.output}"

I'm currently in the process of programming a comprehensive Paperless-ngx CLI tool that will eventually handle such cases automatically.

docb7 commented 3 months ago

Thank you so much! It works like a charm - I posted "my" solution some minutes ago, which uses much more lines, but as I said, I am a beginner ;-) I am looking forward to your CLI! Best regards!

docb7 commented 3 months ago

Hi, sorry to reopen that, I have a very annoying invoice and I alway get "Nothing to to" although there is something to do ;-) And the regex101 test works perfectly. Do you have an idea, how I can figure out what goes wrong? My organize.config.yml.tpl:

# RULES
rules:
  - name: "Rechnung (Betrag mit Komma)"
    locations: *current_document
    filters:
      - filecontent: '(?s:.*\s)(Gesamt|Total|Zahlbetrag|Rechnungsbetrag|Endbetrag|Gesamtsumme|Bruttobetrag|Gesamtbetrag brutto|Zu zahlender Betrag)(\D*)(?P<amount>(\d*\.?\d+\,\d{2}))'
      - python: |
          return {
              "amount": float(filecontent['amount'].replace('.','').replace(',','.')),
          }
    actions:
      - echo: "Script -Rechnung Gesamt- hat ausgelöst - Betrag: {filecontent.amount}"
      - shell: "./pngx-update-document.py --url http://10.11.33.10:34343 --auth-token 4b6xxxx --document-id {env.DOCUMENT_ID} --custom-field-id 1 --custom-field-value {python.amount}"
      - echo: "{shell.output}"

  - name: "Rechnung (Betrag mit Punkt)"
    locations: *current_document
    filters:
      - filecontent: '(?s:.*\s)(Gesamt|Total|Zahlbetrag|Rechnungsbetrag|Endbetrag|Gesamtsumme|Bruttobetrag|Gesamtbetrag brutto|Zu zahlender Betrag)(\D*)(?P<amount>(\d*\.?\d+\.\d{2}))'
    actions:
      - echo: "Script -Rechnung Gesamt- hat ausgelöst - Betrag: {filecontent.amount}"
      - shell: "./pngx-update-document.py --url http://10.11.33.10:34343 --auth-token 4b6xxxx --document-id {env.DOCUMENT_ID} --custom-field-id 1 --custom-field-value {filecontent.amount}"
      - echo: "{shell.output}"

And the OCR Result of the Document from paperless is:

VPS Linux PA-S50 (LZ12.23)
Pos. Artikelbezeichnung EUR Brutto
===========================================================================

1 VPS Linux PA-S50 (LZ12.23): EUR 10,60
1 Monat(e) im Voraus (vom 05.06.2024 bis 04.07.2024),
Preis/Monat: EUR 10,60 in Summe EUR 10,60 USt.: 19,00%

===========================================================================
Umsatzsteuer (19,00%) EUR 1,69
Entspricht der Summe netto EUR 8,91
===========================================================================
Summe Rechnungsbetrag EUR 10,60

Der Rechnungsbetrag i.H.v. 10.60 EUR wird entsprechend der
Prenotification von Ihrem Konto abgebucht.

Regex gives me a match on 10,60 and 10.60 (rule 1 / rule 2). Best b

marcelbrueckner commented 3 months ago

As said, I'm not a regex expert. Perhaps https://github.com/tfeldmann/organize/issues/ would be a better place to address this. But I will try:

What's the point of having (?s:.*\s) at the beginning of your regex? This matches basically everything (and a single whitespace character at the end) before your list of word. But the regex should also work without it (at least it does in regex101). I like to keep my regex as "simple" as possible. Especially because a too complicated regex might exceed available computing resources.

Additional hint for rule 1: the escape sequence before the comma (\,) isn't necessary. Not sure if organize is bothered by this.

Additional hint for rule 2: If the decimal separator is a point, the thousands separator is likely to be a comma ;)

docb7 commented 3 months ago

Hi, thanks, I was hoping there is a kind of debugger ;-) The (?s:.*\s) at the beginning makes that the last occurrence is being used. As in the example above there is several time the word "Rechnungsbetrag" - which would be no problem in the given example, but often it is first "Rechnungsbetrag netto" (without taxes) and then "Rechnungsbetrag brutto" (with taxes), that is why I look for the last occurrence.

And thank you for the hints, but they did not work. Also regexp101 recognizes my search pattern, organize does not. I'll try it with the link you provided, anyway, thanks a lot!

docb7 commented 3 months ago

Hi, I think I found the problem: it is the pdf, because although paperless can read it, organize cannot.

I added a "debugger" to find the possible error, so that the whole file content will be shown in the paperless log:

rules:
  - name: "Debug"
    locations: *current_document
    filters:
      - filecontent
    actions:
      - echo: "Debug - Content of PDF: {filecontent}"

And that shows the content of all files I add, but not of the one invoice I had the problem with. In the "Content" tab of paperless-ngx there is the complete text, but organize seems to use another method to read the pdf which seems not to work with this document. When I look at the properties of the document I find "Safety: Password encrypted" - although there is no password and Coding-Software: "Text® 5.4.0 ©2000-2012 1T3XT BVBA (AGPL-version) (AGPL-version)" and I guess that is the problem.

Is it possible to use the content of the content tab in paperless-ngx instead of the (I guess) poppler results? That might be even speed up performance, I guess?

docb7 commented 3 months ago

So, to solve that, I have used your script and programmed myself with the help of ChatGPT and now it works like a charm. No additional packages necessary:

post-consumtion-wrapper.sh

#!/usr/bin/env bash

# paperless-ngx post-consumption script
#
# https://docs.paperless-ngx.com/advanced_usage/#post-consume-script

 SCRIPT_PATH=$(readlink -f "$0")
 SCRIPT_DIR=$(dirname "$SCRIPT_PATH")
echo "A document with an id of ${DOCUMENT_ID} was just consumed. Calling cpaperless_content.py"
python3 ${SCRIPT_DIR}/paperless_content.py ${DOCUMENT_ID}

And (to put in the same folder as post-consumption-wrapper.sh): paperless_content.py:

#!/usr/bin/env python
import argparse, httpx, os, sys, re, requests
print(f"Content Scanner initialized")

#Check if ID was given
#Aufruf: paperless_content.py <ID>, z.B. papyerless_content.py 175
doc_id = sys.argv[1]
doc_id = str(doc_id)
if doc_id is not None:
    print(f"ID gefunden: {doc_id}")
else:
    print("Keine ID übergeben")
    sys.exit()
# Define the API URL and document ID
url = 'http://url:port'
api_url = url + '/api/documents/' + doc_id + '/'
token = 'your token'

# Set up the headers for authentication
headers = {
    'Authorization': f'Token {token}',
    'Accept': 'application/json'
}
#print (api_url)
# Make the request to get the document data
response = requests.get(api_url, headers=headers)
# Check if the request was successful
if response.status_code == 200:
    doc_data = response.json()
    doc_content = doc_data.get('content')
    #print (doc_content)
    #check if it is an invoice 
    if "Rechnung" in doc_content:
        if "Amazon" in doc_content:
            # Define the regex pattern to search for "Gesamtpreis"
            pattern = r"(Gesamtpreis)(\D*)(?P<amount>(\d*.?\d+,\d{2}))"
 #       elif "Strato" in doc_content:
 #           pattern = r"(Total)(\D*)(?P<amount>(\d*.?\d+,\d{2}))"      
        else: 
            # Define the regex pattern to search for "a lot of other possiblities"
            pattern = r"(Gesamt|Total|Zahlbetrag|Rechnungsbetrag|Endbetrag|Gesamtsumme|Bruttobetrag|Gesamtbetrag brutto|Zu zahlender Betrag)(\D*)(?P<amount>(\d*.?\d+,\d{2}))"

        # Extract the amount if the pattern matches
        match = re.search(pattern, doc_content)
        if match:
            total_price = match.group('amount')
            total_price = total_price.replace('.','').replace(',','.')
            print(f"Price: {total_price}")
        else:
            print("Price not found in document content.")
    else:
        print("No Invoice")
else:
    print(f'Failed to retrieve document: {response.status_code}, {response.text}')

if total_price:
    data = {}
    # Update custom field Nr. 1
    new_field = {
        "field": 1,
        "value": total_price
    }
    # Even when patching a single custom field, we need to include all of the document's existing custom fields
    # Otherwise, other custom fields will be removed from the document
    response = httpx.get(api_url, headers=headers)
    if response.is_error:
        msg = "HTTP error {} while trying to obtain document details via REST API at {}."
        sys.exit(msg.format(response.status_code, args.url))
    data['custom_fields'] = response.json()['custom_fields']
    # Update custom field value "in-place" if already attached to document (to keep custom field order)
    if any(custom_field['field'] == 1 for custom_field in data['custom_fields']):
        data['custom_fields'] = [(new_field if custom_field['field'] == 1 else custom_field) for custom_field in data['custom_fields']]
    # Otherwise, simply append to the list
    else:
        data['custom_fields'] = data['custom_fields'].append(new_field)

if data:
    response = httpx.patch(api_url, headers=headers, json=data)
    if response.is_error:
        msg = "HTTP error {} while trying to update document via REST API at {}."
        sys.exit(msg.format(response.status_code, url, data))

    print(f"Document with ID {doc_id} successfully updated")

Hope that helps ;-)