Closed m3nu closed 8 years ago
Yes, sure.
But, for particular invoices, a single regexp is not enough to extract the needed information. For example, for the invoices BinaryLife Inc. (www.browserstack.com), I have the following text for the date:
Dated: 14t h Dec, 2015
If I extract "14t h Dec, 2015", then invoice2data cannot parse the date, and I don't want to patch date parsing to support "14t h" (with a space between t and h !). I cannot use a regexp to extract the string '14 Dec, 2015' (which would work) ; AFAIK, it is not possible to extract '14 Dec, 2015' with a single regexp. So, for this particular invoice, I would like to execute a few lines of Python code to get the date of the invoice.
So, for me, the evolution is towards a structure in which, for each field, we can have either a regexp OR a few lines of code (which could include one or several regexp). YAML could certainly do the job, because we can have both a structure and include some lines of code (at least it's what we do in Odoo with the YAML tests).
There are a few more things to take into consideration:
1) having a single keyword is sometimes not enough to match the right template. For example, for a large telecom Operator such as Orange, they have a lot of different invoice layout for the different business units of the company, and all these business units have the same VAT number because it is the same company. So I can't only use the VAT number to identify the template ; I would like to use both the VAT number (because it is very accurate) and also one or more strings.
2) multi-language invoices: for example, if I am a French DropBox customer, I will receive an invoice in French with VAT number "IE 9852817J". But, if I am an English DropBox customer, I will receive an invoices in English with the same VAT number. So, if I use the VAT number as keyword, it's a problem because we cannot handle French and English invoices of Dropbox with the same template. Again, I think the good solution is to use several keywords to match the good template for dropbox : the first keyword would be the VAT number (to be sure that we are dealing with a dropbox invoices) and the second keyword would be a string in the language of the invoice, so as "Facture" for the Dropbox French template and "Invoice" for the Dropbox English template".
I think the system with "several keywords" can be implemented very shortly ; we just have to convert keyword from a string to a list of strings.
OK, I added support for multi-keywords, which solves the problem of multi-language I think and also the problem of large suppliers that have many different invoice templates.
@manuelRiel By the way, could you update the template of Amazon and add a keyword that is specific to the German version of the invoice, so that I can add a template of Amazon France.
Now that we have multi-keyword, we can continue the discussion on a possible new format and the issue of the few invoices in which a single regexp is not enough and we require real code.
Implemented a new template system based on Yaml-files. By default it will use our existing templates, but you can also point it to an external folder. Like
python3 -m invoice2data.main --debug --template-folder invoice2data/templates invoice2data/test/pdfs/*
Other changes:
invoice2data/test/pdfs/*
. Better separation. @alexis-via Commit: 86785c9
And:
Up next:
Thanks for all these new features ! It seems really great, in particular the new options, the new yaml files with one file per supplier/template, the ability to define a date format and to define several regexp (I need to test that !)
I am not an expert of the python 2 vs python 3 stuff, but I guess that your move to Python 3 will oblige me to maintain a branch for python2 in order to continue to use invoice2data with Odoo (Odoo uses python 2.7)... unless if there is a way to use a python3 lib from a python2 program without too many headaches (but I don't think so). In this case, we need to:
I am running tests for several py2 and py3 versions. The same code is compatible with 2 and 3. So no need for an extra version.
If you run into errors in py2, just open an issue.
OK, I didn't know it was possible to use the same lib for python 2 and python 3 !
I didn't achieve to use the new version for the moment. First, extrac_data has 1 more argument (not mentionned in README) ; I think it should be an optional argument (if this second arg is not set, it defaults to using all templates). Then, I have a crash:
>>> extract_data('/home/alexis/myinvoice.pdf', '/home/alexis/new_boite/dev/invoice2data/invoice2data/templates/fr')
DEBUG:invoice2data.main:number of char in pdf2text extract: 2826
DEBUG:invoice2data.main:Testing 83 template files
'/'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/invoice2data-0.2.0-py2.7.egg/invoice2data/main.py", line 59, in extract_data
logger.debug("keywords=%s", t['keywords'])
TypeError: string indices must be integers, not str
But I need more time to investigate it ; I'll have a look this week.
I had a look at the dateparser lib ; it seems good, but it's not magic either. For example, see my comment in https://github.com/scrapinghub/dateparser/issues/52
Another example, with the latest version of dateparser:
>>> dateparser.parse(u'14th Dec, 2015', languages=['en'])
datetime.datetime(2015, 12, 14, 0, 0)
>>> dateparser.parse(u'14thDec,2015', languages=['en'])
>>> dateparser.parse(u'14t h Dec, 2015', languages=['en'])
datetime.datetime(2015, 12, 27, 14, 0)
So for my example of the invoices of BinaryLife Inc. (www.browserstack.com), I still need to execute some python code, or at least do a replace before sending the date to parsing.
I have a similar problem with the invoices of Free Mobile: they use a special char for "é" in Décembre (December in French) : they use "é", so it doesn't work with dateparser. I implemented an option remove_accents (via unidecode) so that it becomes "Decembre", but it doesn't work with dateparser either... it needs Décembre with the accent to work ! Again, a replace would be enough.
Another big pb that I just discovered: it seems dateparser 0.3.2 released a few days ago is broken:
1:11 alexis@silence ~ % python
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from invoice2data import extract_data
ou
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/invoice2data-0.2.0-py2.7.egg/invoice2data/__init__.py", line 1, in <module>
from .main import extract_data
File "/usr/local/lib/python2.7/dist-packages/invoice2data-0.2.0-py2.7.egg/invoice2data/main.py", line 10, in <module>
import dateparser
File "/usr/local/lib/python2.7/dist-packages/dateparser-0.3.2-py2.7.egg/dateparser/__init__.py", line 4, in <module>
from .date import DateDataParser
File "/usr/local/lib/python2.7/dist-packages/dateparser-0.3.2-py2.7.egg/dateparser/date.py", line 11, in <module>
from dateparser.date_parser import date_parser
File "/usr/local/lib/python2.7/dist-packages/dateparser-0.3.2-py2.7.egg/dateparser/date_parser.py", line 16, in <module>
from .conf import apply_settings
File "/usr/local/lib/python2.7/dist-packages/dateparser-0.3.2-py2.7.egg/dateparser/conf.py", line 62, in <module>
settings = Settings()
File "/usr/local/lib/python2.7/dist-packages/dateparser-0.3.2-py2.7.egg/dateparser/conf.py", line 31, in __init__
self._updateall(self._get_settings_from_yaml().items())
AttributeError: 'NoneType' object has no attribute 'items'
EDIT : Mmm, I can't reproduce this issue any more, but there is an issue that I am always able to reproduce: I can't use dateparser from Odoo (I mean, odoo uses invoice2data which uses dateparser), it always fail in dateparser/conf.py on this line: https://github.com/scrapinghub/dateparser/blob/master/dateparser/conf.py#L44
2016-01-28 08:23:21,731 3849 INFO o8_test1 openerp.modules.loading: loading 1 modules...
2016-01-28 08:23:21,738 3849 INFO o8_test1 openerp.modules.loading: 1 modules loaded in 0.01s, 0 queries
2016-01-28 08:23:22,149 3849 INFO o8_test1 openerp.modules.loading: loading 100 modules...
2016-01-28 08:23:22,261 3849 INFO o8_test1 openerp.modules.loading: 100 modules loaded in 0.11s, 0 queries
2016-01-28 08:23:22,270 3849 INFO o8_test1 openerp.modules.loading: loading 102 modules...
2016-01-28 08:23:22,357 3849 CRITICAL o8_test1 openerp.modules.module: Couldn't load module account_invoice_import
2016-01-28 08:23:22,358 3849 CRITICAL o8_test1 openerp.modules.module: find_module() takes exactly 3 arguments (2 given)
> /usr/lib/python2.7/pkgutil.py(475)find_loader()
-> loader = importer.find_module(fullname)
(Pdb) fullname
'data'
(Pdb) up
> /usr/lib/python2.7/pkgutil.py(464)get_loader()
-> return find_loader(fullname)
(Pdb) up
> /usr/lib/python2.7/pkgutil.py(578)get_data()
-> loader = get_loader(package)
(Pdb) up
> /usr/local/lib/python2.7/dist-packages/dateparser-0.3.2-py2.7.egg/dateparser/conf.py(44)_get_settings_from_yaml()
-> data = get_data('data', 'settings.yaml')
(Pdb) data
*** NameError: name 'data' is not defined
Just don't remove the whitespace if date parsing doesn't work without it.
@m3nu If I remove the whitespace, dateparser returns a wrong result:
dateparser.parse(u'14t h Dec, 2015', languages=['en'])
datetime.datetime(2015, 12, 27, 14, 0)
Then you can use the yml-template option to give the date format. That will work for sure.
We can add another option to set the language if that improves it or auto-detect the invoice language.
FYI, I spent more time investigating tonight and I confirm that Odoo is currently incompatible with dateparser (and it has nothing to do with invoice2data). So I opened a full bug report on odoo https://github.com/odoo/odoo/issues/10670 and I also opened a bug report at dateparser https://github.com/scrapinghub/dateparser/issues/141 This bug is not easy and I am not able to fix it by myself for the moment.
I'll wait a few days to see if someone comes up with a solution to this incompatibility. If not, I'm afraid I'll have to develop a branch of invoice2data that doesn't use dateparser... but I hope it won't happen because it would split our efforts in two separate branches (as I would only be motivated to develop in the branch that work with odoo).
The bug has been fixed in Odoo and the patch is already in Odoo v8 (incredible !!!).
Unfortunately, we are not finished with the issues with the recent changes... because I still have a crash when trying to install the odoo module account_invoice_import. Here is the cause of this "next" issue (I hope it is the last one !) : when I install invoice2data with the command "sudo python ./setup.py install", it reads requirements.txt that lists pdfminer3k (for python3), so it installs pdfminer3k. But, after that, if I import "from pdfminer.pdfparser import PDFParser" (the odoo module account_invoice_import needs that), I have a crash:
% python
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdfminer.pdfparser import PDFParser
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pdfminer3k-1.3.0-py2.7.egg/pdfminer/pdfparser.py", line 7, in <module>
from .psparser import PSStackParser, PSSyntaxError, PSEOF, literal_name, LIT, KWD, handle_error
File "/usr/local/lib/python2.7/dist-packages/pdfminer3k-1.3.0-py2.7.egg/pdfminer/psparser.py", line 4, in <module>
from .utils import choplist
File "/usr/local/lib/python2.7/dist-packages/pdfminer3k-1.3.0-py2.7.egg/pdfminer/utils.py", line 212, in <module>
0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff,
File "/usr/local/lib/python2.7/dist-packages/pdfminer3k-1.3.0-py2.7.egg/pdfminer/utils.py", line 180, in <genexpr>
PDFDocEncoding = ''.join( chr(x) for x in (
ValueError: chr() arg not in range(256)
The details of what is installed:
% pip freeze|grep pdfm
pdfminer==20140328
pdfminer3k==1.3.0
The only solution I found so far is to manually uninstall pdfminer3k. @sebastienbeau Do you have a recommendation about how to handle this ?
It seems the solution is there:
I made a PR for that here https://github.com/m3nu/invoice2data/pull/14
Thanks @bguillot. So this is working for everyone now?
@m3nu at least it is working fine on python2 with Odoo without hack. I already updated the installation procedure of the Odoo module that use invoice2data... it's now much simpler to install !
Hello guys instead of using yaml files as templates format can't we use json format files. I have a project and i want the templates to be json format how can i do it ? Thanks
First the current
template.py
is really a data structure, rather than Python code. It should be in json or even better YAML. Next it should be possible to add separate templates, depending on the current project.