invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.79k stars 474 forks source link

ERROR:root: No template for Invoice.pdf #512

Open nozprod opened 1 year ago

nozprod commented 1 year ago

Hey,

Always getting the same error, tried both custom and predefined templates (using a french Google bill as example).

Any idea ?

My template :

issuer: Google Commerce Limited
fields:
  date: Récapitulatif pour la période suivante\s+:\s+(\d{1,2}\s+\w+\.\s+\d{4})
  ttc: Total en EUR\s+([\d,]+) €
  ht: Sous-total en EUR\s+([\d,]+) €
  tva_rate: TVA \((\d{1,3})%\)
  tva_amount: TVA \(\d{1,3}%\)\s+([\d,]+) €
keywords:
  - Google Commerce Limited
  - IE9825613N
options:
  currency: EUR
  date_formats:
    - '%d .%b .%Y'
  languages:
    - fr
  decimal_separator: ','

An example bill is attached Invoice.pdf

bosd commented 1 year ago

I don't see the error. Which message do you get? Which os are you using?

Did you run it with the --debug flag. to get more detailed feedback?

nozprod commented 1 year ago

Hi, here is the full message

Capture d’écran 2023-05-23 à 20 50 34

And I'm running MacOS 12.6.5

The --debug flag doesn't help, or I may be doing something wrong... In case it helps, here is my script :

import os
import pytesseract
import argparse
import logging.config
import logging
from pdf2image import convert_from_path
from invoice2data import extract_data
from invoice2data.extract.loader import read_templates
import google.auth
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# Parser les arguments de la ligne de commande
parser = argparse.ArgumentParser()
parser.add_argument("--debug", action="store_true",
                    help="activer le mode de débogage")
args = parser.parse_args()

# Configurer le logger
if args.debug:
    logging.basicConfig(level=logging.DEBUG)
else:
    logging.basicConfig(level=logging.INFO)

def read_templates_from_folder(folder):
    result = []
    for path, subdirs, files in os.walk(folder):
        for name in files:
            if name.endswith(('.yml', '.yaml')):
                file_path = os.path.join(path, name)
                result.extend(read_templates(file_path))
                print(f"Template loaded from {file_path}")
    return result

def extract_data_from_invoice(pdf_path):
    print("Extraction du texte à partir du fichier PDF...")
    templates = read_templates_from_folder('Templates/')
    try:
        data = extract_data(pdf_path, templates=templates)  # Ajoutez cette ligne
        print(f"Data extracted: {data}")  # Ajoutez cette ligne
        return data
    except Exception as e:
        print(f"Error during extraction: {e}")
        return None

def extract_invoice_data(data):
    if not data:
        return None

    invoice_data = {}

    date = data.get('date')
    if date:
        invoice_data['date'] = date.strftime("%d/%m/%Y")

    invoice_data['ht'] = data.get('ht')
    invoice_data['tva_rate'] = data.get('tva_rate')
    invoice_data['tva_amount'] = data.get('tva_amount')
    invoice_data['ttc'] = data.get('ttc')

    return invoice_data

def authenticate_google_sheets():
    print("Authentification et création du service Google Sheets...")
    creds = None
    SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
    token_path = 'token.json'
    credentials_path = 'credentials.json'

    if creds and creds.expired and creds.refresh_token:
        creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(credentials_path, SCOPES)
        creds = flow.run_local_server(port=0)

    return build('sheets', 'v4', credentials=creds)

def update_google_sheet(service, sheet_id, data):
    print("Mise à jour de la feuille de calcul Google Sheets...")
    range_name = 'Sheet1!A1:E1'
    values = [[data['date'], data['ht'], data['tva_rate'], data['tva_amount'], data['ttc']]]
    body = {'values': values}

    try:
        result = service.spreadsheets().values().append(
            spreadsheetId=sheet_id, range=range_name,
            valueInputOption='USER_ENTERED', insertDataOption='INSERT_ROWS', body=body).execute()
        print('{0} cells appended.'.format(result.get('updates').get('updatedCells')))
    except HttpError as error:
        print('An error occurred: {0}'.format(error))
        return None

if __name__ == '__main__':
    pdf_path = 'Bills/Invoice.pdf'
    data = extract_data_from_invoice(pdf_path)
    invoice_data = extract_invoice_data(data)
    print("Données de facture extraites:", invoice_data)

    if invoice_data:
        sheets_service = authenticate_google_sheets()
        sheet_id = '1A9UPxQ7uR6znZmZ96xJ7Klg2ycXKdFtGrSHmkVND2hs'
        update_google_sheet(sheets_service, sheet_id, invoice_data)
    else:
        print("Aucune donnée de facture extraite.")
nozprod commented 1 year ago

Hey @bosd Any idea ?

bosd commented 1 year ago

Hey @bosd Any idea ?

Not yet. At first sight code looks good. Maybe add a try except block on the read templates. To see if there any errors encountered loading the templates.

Did you verify your installation is working by using one of the supplied examples and running it from the command line?

nozprod commented 1 year ago

It doesn't work either with the templates supplied. So it's seems it's my installation... I'll go with the try except block.

Capture d’écran 2023-06-02 à 16 51 40
bosd commented 1 year ago

I actually ment try to check if your template / input file is correct. By first testing it from the command line, thus bypassing your custom code.

invoice2data Invoice.pdf --input-reader=pdftotext --template-folder=/home/templates --debug

Also kindly make sure, your template includes a exclude_keywords and priority tag.

nozprod commented 1 year ago

I'm sorry I'm not very familiar with all of this, but I try hard, and this is the new issue I get...

Capture d’écran 2023-06-09 à 17 07 15

I tried to find any .DS_Store file and remove it, without success. I also uninstalled / reinstalled invoice2data, same issue happening.

bosd commented 1 year ago

Now we are getting somewhere 😁 While adding the json support I assumed people where only storing .yml or .json files. Having any other file in your template directory resolves in an error.

This is actually fixed by #509 in the source code of this repo. Yet it has not been released on pypy.

So to resolve this you can download the source code straight from the master branch. And use that...It should work for you..

Or delete all the ds store files. The ds store files can be annoying to get rid off and keep popping up. When all those files are gone. You should not get this error anymore.

nozprod commented 1 year ago

Thanks a lot, I'll try today 😉

nozprod commented 1 year ago

So it worked, my templates are working and I'm able to extract datas by using Invoice2Data directly \o/ But for any reason, I can't make it work through my custom script 😢 I still get the "No template" error...

bosd commented 1 year ago

But for any reason, I can't make it work through my custom script cry

It might still be related to the ds_store files or any other file which makes the template directory dirty.

Did you update your installed code from the source in this repo? You can download the zip from this repo, copy the src/invoice2data contents to the location where the lib is installed. example, on my machine it is: /home/user/.local/lib/python3.11/site-packages/invoice2data/

Try to debug your custom template reading, by printing the results.

def read_templates_from_folder(folder):
    result = []
    for path, subdirs, files in os.walk(folder):
        for name in files:
            if name.endswith(('.yml', '.yaml')):
                file_path = os.path.join(path, name)
                result.extend(read_templates(file_path))
                print(f"Template loaded from {file_path}")
    print(result)
    return result
changtraisitinh commented 1 year ago

So it worked, my templates are working and I'm able to extract datas by using Invoice2Data directly \o/ But for any reason, I can't make it work through my custom script 😢 I still get the "No template" error...

I custom source and install with python setup.py install. hope for you.