Open abhigyanasatpathy opened 2 weeks ago
Hi, Your steps for adding a template are correct.
Did you verify your installation of invoice2data is running properly, by testing I on one of the example files?
Yes it is running properly. Thank you for cooperating me. Btw can you please tell me the process again? I have created templates/myinvoice and inside it in.myinvoice.yml and regex according to my pdf . So is that the process enough to convert my pdf to csv in output? Or any other process or code i need to add , please tell me simply? I have already run your existing template working fine.
Your invoked command seems ok.
Some debugging steps
[x] Verify your installation and parsing of sample file.
[ ] Run with --debug
flag to check the output of the invoice-xx.pdf file.
This likely is the problem. As invoice2data trys to fall back on ocrmypdf. Which is likley due to the fact that it cannot detect characters with pdftotext.
Is your pdf file a text based file? or does it need ocr?
[ ] Try your pdf with different input parser --input-reader=
then use pdftotext
or ocrmypdf
[ ] Check your template for syntax errors
My pdf file is text based file. I have only created one file in.invoicedemo.yml (path: D:\invoice2data-master\src\invoice2data\extract\templates\in\in.invoicedemo.yml) as step-1 Should i proceed only with this process step-1 or any other steps i should follow? Is there any other steps where i need to code or whatever else?
So in in.invoicedemo.yml file i have woked on regex expressions and keywords according to my pdf .
When you run invoice2data on the pdf file with the --debug flag, do you see the contents of the file in your logger/terminal?
No , i cannot see contents of the file. I can see only pdf to text data in logger (using --debug flag) But i cannot see data in csv file . Getting error in logger: ♀←[0m DEBUG:←[0mroot: END pdftotext result =============================←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: au.com.opal.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: au.com.telstra.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.accor.invest.ibis.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.accor.invest.novotel.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.boucherie.pochet.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.cebeo.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.eg_retail.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.lampiris.facture-dacompte.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.lampiris.factuur.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.lampiris.regularisation.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.melchior-vins.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.proximus.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.scarlet.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.securex.social.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: ch.pcengines.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invo . . .DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.bmw-fs.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.insert.subiekt-gt.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.insert.subiekt-nexo.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.orlen.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.p4.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.paypro.yml | Failed to match all keywords.←[0m ←[94mINFO:←[0mpikepdf._core:←[94m pikepdf C++ to Python logger bridge initialized←[0m DEBUG:←[0mroot: Text extraction failed, falling back to ocrmypdf←[0m DEBUG:←[0mroot: Text extraction failed, falling back to ocrmypdf←[0m DEBUG:←[0minvoice2data.input.ocrmypdf: input_reader_config received from main are, {}←[0m DEBUG:←[0minvoice2data.input.ocrmypdf: ocrmypdf config settings are: {'redo_ocr': True, 'optimize': 0, 'output_type': 'pdf', 'fast_web_view': 0}←[0m
←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m This PDF is marked as a Tagged PDF. This often indicates that the PDF was generated from an office document and does not need OCR. PDF pages processed by OCRmyPDF may not be tagged correctly.←[0m OCR ---------------------------------------- 0% 0/1 -:--:--←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m Weighted average image DPI is 152.1, max DPI is 247.7. The discrepancy may indicate a high detail region on this page, but could also indicate a problem with the input PDF file. Page image will be rendered at 400.0 DPI.←[0m OCR ---------------------------------------- 100% 1/1 0:00:00 Linearizing ---------------------------------------- 100% 100/100 0:00:00 ←[94mINFO:←[0minvoice2data.input.ocrmypdf:←[94m Text extraction made with ocrmypdf←[0m DEBUG:←
The result from pdftotext is empty.
So you're likely running into dependency issues from pdftotext / poppler utils on windows. Currently windows is not well supported and tested.
There is an open pr to enhance support. But tests are failling. https://github.com/invoice-x/invoice2data/pull/565
I'm a linux user. So cannot give you a lot of support on windows.
But existing templates are working fine . I am not able to extract my pdf data.
There is one file : path: D:\invoice2data-master\invoice2data-env\Lib\site-packages\invoice2data-0.4.5.dist-info\RECORD should i need to do anything with this file for new templates? or i need to just create templates?
Just creating the templates should be fine.
Let's check if the template you have created has been loaded.
Do you see your template in the list of loaded templates?
Loaded templates meaning ? -- D:\invoice2data-master\src\invoice2data\extract\templates\in\in.demovoice.yml -- this one i can see..
But not able to see here: DEBUG:←[0mroot: END pdftotext result =============================←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: au.com.opal.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: au.com.telstra.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.accor.invest.ibis.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.accor.invest.novotel.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.boucherie.pochet.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.cebeo.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.eg_retail.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.lampiris.facture-dacompte.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.lampiris.factuur.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.lampiris.regularisation.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.melchior-vins.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.proximus.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.scarlet.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: be.securex.social.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: ch.pcengines.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.AzureInterior.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.amazon.aws.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.apple.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.apps4rent.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.binarylife.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.bloomberg.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.cloudns.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.datadoghq.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.digitalocean.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.envato.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.expressvpn.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.expressvpn_prio6.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.ftserussell.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.github.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.globalsign.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.google.adwords.hk.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.hobohost.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.jamiepro.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.linode.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.microsoftonline.hk-v2017.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.microsoftonline.hk.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.mongodb.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.namecheap.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.namesilo.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.newrelic.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.nl.lenovo.digitalriver.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.nmmn.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.nodisto.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.nyse.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.oyo.invoice.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.packtpub.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.pixartprinting.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.sammymaystone.yml | Keywords matched. No exclude keywords found.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.scaleway.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.textmaster.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.tmx.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.travis-ci.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.twitter.de.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.twitter.uk.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.twitter.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.upwork.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.usersnap.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.amazon.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.bettina-kast.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.digikey.com.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.hosteurope.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.notebooksbilligerBillPay.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.ovh.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.qualityhosting.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: de.united-domains.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.pepephone.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: es.supplies24.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: co.mooncard.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.adobe.ie.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.akretion.fr.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.amazon.aws.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.ateliercopieservice.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.chauffeur-prive.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.coriolis.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.easyjet.fr.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.eaudugrandlyon.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.godaddy.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.google.ie.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.hootsuite.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.jeanbesson.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.ldlc.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.linkedin.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.mention.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.microsoft.ie.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.myflyingbox.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.officetimeline.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.orange-business.mobile.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.ovh.fr.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.rs-online.fr.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.saur.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.soyoustart.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: com.vinci-autoroutes.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: dolibarr.generique.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: eu.trainline.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.actn.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.airfrance.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.also.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.amazon.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.assurance-epargne-pension.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.bouyguestelecom.adsl-fiber.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.bouyguestelecom.mobile.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.butagaz.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.chronopost.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.dirafi.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.domaine-achat.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.easytrip.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.edf.entreprises.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.edf.pme.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.finagaz.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.fountain.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.free.adsl-fiber.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.free.mobile.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.free.mobile2.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.futur.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.ge-iroise.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.greffe-tc-lyon.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.hiscox.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.internetsatellite.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.jpg.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.kubii.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.laposte.boutique.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.laposte.coliposte.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.lecab.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.leroymerlin.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.maaf.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.mediapart.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.moneo-resto.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.mouser.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.mycelium-roulement.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.napsis.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.nexity.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.orange.fibre.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.orange.fixedline.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.prestaclic.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.publicationannoncelegale.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.sfr.adsl-fiber.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.sfr.mobile.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.sosh.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.teledec.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: fr.topoffice.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: net.online.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: net.scaleway.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.action.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.albron.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.anwb.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.be.coolblue.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.begra.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.blokker.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.bouwmans.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.bp.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.bunq.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.cpe.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.esso_eg_services.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.esso_eg_services_v2.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.farnell.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.ferbox.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.gamma.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.goos.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.gulf.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.ipparking.paleiskwartier.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.karwei.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.kav.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.koffiehenk.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.momentsenmore.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.ns.invoice.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.ok.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.parkmobile.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.praxis.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.reclameland.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.saeco.philips.eluscious.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.shell_nederland.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.shell_schellenkens.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.simpel.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.total_express.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.total_ototol.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.transip.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.tuynder.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.vistaprint.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.vodafone.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.wasco.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.weid.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.yezzer.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: nl.zinkunie.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.bmw-fs.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.insert.subiekt-gt.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.insert.subiekt-nexo.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.orlen.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.p4.yml | Failed to match all keywords.←[0m DEBUG:←[0minvoice2data.extract.invoice_template: Template: pl.paypro.yml | Failed to match all keywords.←[0m
Why? so asked i just only created yml file and my regex inside template folder ..
So is there anything i need to follow up ?
Why?
Because you need to check if the template you have created is properly loaded.
Check if your pointing to the correct folder.
(You can disable the built in templates with the following flag to reduce the noise: --exclude-built-in-templates
)
You should see your template in that list.
If your template is correct is should say that the keywords have matched..
followed by a.. using template <your template file>
Even after i deleted my templates still it is parsing existing pdf . How's it possible? Deactivated again activate it though.
You have to verify if your template is being loaded.
Are you pointing to the correct folder? -- yes Is your custom template loaded? Or does the debugger show that there is an error in your template? yes error showing Is your template selected? Do the keywords match? yes checking
But not able to understand when i deleted existing templates for my test purpose, still its working , so i have doubt how is it possible? From where it is matching keywords it should show that yml file not available but still showing after deleting (for my test purpose)
\ But not able to understand when i deleted existing templates for my test purpose, still its working , so i have doubt how is it possible?
That sounds like a folder issue.
Maybe it is installed in different versions or locations.
What is the path which shows when you do 'pip show invoice2data'?
Is that the same location as where you where deleting the files?
My template location path is : D:\invoice2data-master\src\invoice2data\extract\templates Is it okay?
No, because your standard templates are loaded from the directory in the screenshot.
For easy testing gi to that location and delete the standard templates there. Or add your own custom ones there.
Steps to add new template
To add a new template, we recommend this workflow:
1. Copy existing template to new file
Find a template that is roughly similar to what you need and copy it to a new file. It's good practice to use reverse domain notation. E.g.
country.company.division.language.yml
orfr.mobile.enterprise.french.yml
. Language is not always needed. Template folder are searched recursively for files ending in.yml
.2. Change invoice issuer
Just used in the output. Best to use the company name.
3. Set keyword
Look at the invoice and find the best identifying string. Tax number + company name are good options. Remember, all keywords need to be found for the template to be used.
Keywords are compared before processing the extracted text.
4. First test run
Now we're ready to see how far we are off. Run
invoice2data
with the following debug command to see if your keywords match and how much work is needed for dates, etc.invoice2data --template-folder tpl --debug invoice-XXX.pdf
This test run shows you how the program will "see" the text in the invoice. Parsing PDFs is sometimes a bit unpredictable. Also make sure your template is used. You should already receive some data from static fields or currencies.
5. Add regular expressions
Now you can use the debugging text to add regex fields for the information you need. It's a good idea to copy parts of the text directly from the debug output and then replace the dynamic parts with regex. Keep in mind that some characters need escaping. To test, re-run the above command.
date
field: First capture the date. Then see ifdateparser
handles it correctly. If not, add your format or language under options.amount
: Capture the number without currency code. If you expect high amounts, replace the thousand separator. Currently we don't parse numbers via locals (TODO)6. Done
Now you're ready to commit and push your template, so others get a chance to use and improve it.
My Question: I have added new template in yml with regex accordingly but when i am parsing that invoice pdf it is not parsing showing error .
Error message: (invoice2data-env) D:\invoice2data-master\src\invoice2data>invoice2data --output-format csv --output-name output/invoices.csv input/demoinvoice.pdf ←[94mINFO:←[0minvoice2data.extract.loader:←[94m Loaded 189 templates from D:\invoice2data-master\invoice2data-env\Lib\site-packages\invoice2data\extract\templates←[0m ←[94mINFO:←[0mpikepdf._core:←[94m pikepdf C++ to Python logger bridge initialized←[0m Scanning contents ---------------------------------------- 100% 1/1 0:00:00 ←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m This PDF is marked as a Tagged PDF. This often indicates that the PDF was generated from an office document and does not need OCR. PDF pages processed by OCRmyPDF may not be tagged correctly.←[0m OCR ---------------------------------------- 0% 0/1 -:--:--←[1;43mWARNING:←[0mocrmypdf._pipeline:←[1;43m Weighted average image DPI is 152.1, max DPI is 247.7. The discrepancy may indicate a high detail region on this page, but could also indicate a problem with the input PDF file. Page image will be rendered at 400.0 DPI.←[0m OCR ---------------------------------------- 100% 1/1 0:00:00 Linearizing ---------------------------------------- 100% 100/100 0:00:00 ←[94mINFO:←[0minvoice2data.input.ocrmypdf:←[94m Text extraction made with ocrmypdf←[0m ←[1;41mERROR:←[0mroot:←[1;41m No template for input/demoinvoice.pdf←[0m