Closed m3nu closed 6 years ago
Should I add single line docstring (telling what it does and returns) or a multi-line (giving info about all variables and return values)?
Complex public-facing functions will need multiple lines with all details.
Simple internal helper functions may only need a single line.
In the end, anyone who never used the library needs to know enough about arguments and return values to get started quickly. This is true for invoice2data and even more so for facturx.
How does this look for extract_data()
"""Function to extract data from e-invoices
:param invoicefile: path of invoice
:param templates: load templates (template name)
:param input_module: input module to use out of input_mapping dict. (default pdftotext)
:return: dict of extracted and matched fields or False if no template matches
"""
Like in maths this needs precision because those things make a difference.
Like this?
"""extract data from e-invoices
* reads template if no template assigned
* text is extracted from invoice using input module and stored in extracted_str
* extracted text is sent for optimization in prepare_input()
* required fields are matches from templates
Note: Import all required module when using as a library
Example: If you want to use pdfminer
>>> from invoice2data.main import pdfminer
>>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)
:param invoicefile: path of invoice (example: "/home/duskybomb/pdf/invoice.pdf")
:param templates: load templates (template name)
:param input_module: input module to use out of input_mapping dict. (default pdftotext)
:var extracted_str: saves extracted text from input module
:var optimized_str: stores optimized string after passing extracted_str to prepare_input()
:return: dict of extracted and matched fields, False if no template matches
"""
Made new commits to branch issue/119. Do check.
I like the idea of giving an example. That's good. PEP doesn't define a sample for an example, but Google does.
For the first sentence, let's avoid the word "e-invoice" since it's not defined or used anywhere. Maybe like this:
Extracts structured elements (like data, amount, etc) from PDF invoices using different extraction backends and pre-defined templates.
The list (reads template...) is really an extension of input and output params. So let's move it there and extend the input- and output arguments a bit.
And regarding params:
invoicefile
needs to be a path, more precisely a str
with the path of the PDF invoice to extract data from.templates
: You need to mention that this is an optional param and what happens when it's not defined. You also need to tell the reader that templates are loaded using the helper class we have and best give an example for it. Or add a good Docstring to the InvoiceTemplate
class.input_module
: Here you need to mention that the module needs to be imported first and maybe 2-3 words on what it does. And a note that people can define their own module.Then the part of extracted- vs optimized string is an implementation detail and doesn't belong in the docstring. It's not important for using the function.
The return value could use an example again. Otherwise it's correct.
I was using e-invoice and avoiding PDF invoice because invoicefile
doesn't necessarily be a pdf. Can also be an image, right?
Valid point. It can be in image if the right input-module is chosen. But that's not documented or tested anywhere. You can add this feature when improving the OCR module.
Until then you could use "path to electronic invoice file in PDF, JPEG or [whatever is supported] format". "E-invoice" alone is more like a brand name and not clearly defined.
Done.
"""Extracts structured elements (like data, amount, etc) from PDF/image invoices using different extraction backends
and pre-defined templates.
* reads template if no template assigned
* required fields are matches from templates
:param invoicefile:
path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
:type invoicefile: str
:param templates: This parameter is optional
load templates as .yml file (template name). Templates are loaded using read_template function in loader.py
:param input_module: This parameter is optional
library to be used to extract text from given invoicefile,
Currently supported libraries are: pdftotext, pdfminer, tesseract (default pdftotext)
Note:
Import required input_module when using invoice2data as a library
Example: If you want to use pdfminer
>>> from invoice2data.main import pdfminer
>>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)
:return: extracted and matched fields, False if no template matches
:rtype: dict
Example: {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0),
'invoice_number': 'IBZY2087', 'currency': 'INR', 'desc': 'Invoice IBZY2087
from OYO'}
"""
Should I make a PR now?
Looks good. Let me just go over all the changes. Then I probably push a small commit. Then you can merge the branch. I'll also be on IRC today and working on factur-x for most of the day.
"""Extracts structured data from PDF/image invoices.
This function uses the text extracted from a PDF file or image and
pre-defined regex templates to find structured data.
Reads template if no template assigned
Required fields are matches from templates
Parameters
----------
invoicefile : str
path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
templates : list of instances of class `InvoiceTemplate`, optional
Templates are loaded using `read_template` function in `loader.py`
input_module : {'pdftotext', 'pdfminer', 'tesseract'}, optional
library to be used to extract text from given `invoicefile`,
Returns
-------
dict or False
extracted and matched fields or False if no template matches
Notes
-----
Import required `input_module` when using invoice2data as a library
See Also
--------
read_template : Function where templates are loaded
InvoiceTemplate : Class representing single template files that live as .yml files on the disk
Examples
--------
When using `invoice2data` as an library
>>> from invoice2data.input import pdftotext
>>> extract_data("invoice2data/test/pdfs/oyo.pdf", None, pdftotext)
{'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0), 'invoice_number': 'IBZY2087',
'currency': 'INR', 'desc': 'Invoice IBZY2087 from OYO'}
"""
based on Numpydoc. Please check IRC
Done and Closing! 🥂
Hooray!
To help users get started quickly, let's add some docstrings. I linked to some style guides in the wiki that deal with the practice.