invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.85k stars 482 forks source link

Add proper Docstrings to all public functions and classes #119

Closed m3nu closed 6 years ago

m3nu commented 6 years ago

To help users get started quickly, let's add some docstrings. I linked to some style guides in the wiki that deal with the practice.

duskybomb commented 6 years ago

Should I add single line docstring (telling what it does and returns) or a multi-line (giving info about all variables and return values)?

m3nu commented 6 years ago

Complex public-facing functions will need multiple lines with all details.

Simple internal helper functions may only need a single line.

m3nu commented 6 years ago

In the end, anyone who never used the library needs to know enough about arguments and return values to get started quickly. This is true for invoice2data and even more so for facturx.

duskybomb commented 6 years ago

How does this look for extract_data()

"""Function to extract data from e-invoices

:param invoicefile: path of invoice
:param templates: load templates (template name)
:param input_module: input module to use out of input_mapping dict. (default pdftotext)
:return: dict of extracted and matched fields or False if no template matches
"""
m3nu commented 6 years ago

Like in maths this needs precision because those things make a difference.

duskybomb commented 6 years ago

Like this?

 """extract data from e-invoices
     * reads template if no template assigned
     * text is extracted from invoice using input module and stored in extracted_str
     * extracted text is sent for optimization in prepare_input()
     * required fields are matches from templates

    Note: Import all required module when using as a library
        Example: If you want to use pdfminer
            >>> from invoice2data.main import pdfminer
            >>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)

    :param invoicefile: path of invoice (example: "/home/duskybomb/pdf/invoice.pdf")
    :param templates: load templates (template name)
    :param input_module: input module to use out of input_mapping dict. (default pdftotext)

    :var extracted_str: saves extracted text from input module
    :var optimized_str: stores optimized string after passing extracted_str to prepare_input()

    :return: dict of extracted and matched fields, False if no template matches
 """

Made new commits to branch issue/119. Do check.

m3nu commented 6 years ago

I like the idea of giving an example. That's good. PEP doesn't define a sample for an example, but Google does.

For the first sentence, let's avoid the word "e-invoice" since it's not defined or used anywhere. Maybe like this:

Extracts structured elements (like data, amount, etc) from PDF invoices using different extraction backends and pre-defined templates.

The list (reads template...) is really an extension of input and output params. So let's move it there and extend the input- and output arguments a bit.

And regarding params:

Then the part of extracted- vs optimized string is an implementation detail and doesn't belong in the docstring. It's not important for using the function.

The return value could use an example again. Otherwise it's correct.

duskybomb commented 6 years ago

I was using e-invoice and avoiding PDF invoice because invoicefile doesn't necessarily be a pdf. Can also be an image, right?

m3nu commented 6 years ago

Valid point. It can be in image if the right input-module is chosen. But that's not documented or tested anywhere. You can add this feature when improving the OCR module.

Until then you could use "path to electronic invoice file in PDF, JPEG or [whatever is supported] format". "E-invoice" alone is more like a brand name and not clearly defined.

duskybomb commented 6 years ago

Done.

"""Extracts structured elements (like data, amount, etc) from PDF/image invoices using different extraction backends
        and pre-defined templates.
     * reads template if no template assigned
     * required fields are matches from templates

    :param invoicefile:
        path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
        :type invoicefile: str
    :param templates: This parameter is optional
        load templates as .yml file (template name). Templates are loaded using read_template function in loader.py
    :param input_module: This parameter is optional
        library to be used to extract text from given invoicefile,
        Currently supported libraries are: pdftotext, pdfminer, tesseract (default pdftotext)

        Note:
            Import required input_module when using invoice2data as a library
            Example: If you want to use pdfminer
                >>> from invoice2data.main import pdfminer
                >>> extract_data("/home/duskybomb/pdf/invoice.pdf", "com.aws-invoice.yaml", pdfminer)

    :return: extracted and matched fields, False if no template matches
    :rtype: dict
        Example:    {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0),
                    'invoice_number': 'IBZY2087', 'currency': 'INR', 'desc': 'Invoice IBZY2087
                    from OYO'}

"""

Should I make a PR now?

m3nu commented 6 years ago

Looks good. Let me just go over all the changes. Then I probably push a small commit. Then you can merge the branch. I'll also be on IRC today and working on factur-x for most of the day.

duskybomb commented 6 years ago
"""Extracts structured data from PDF/image invoices.

    This function uses the text extracted from a PDF file or image and
    pre-defined regex templates to find structured data.

    Reads template if no template assigned
    Required fields are matches from templates

    Parameters
    ----------
    invoicefile : str
        path of electronic invoice file in PDF,JPEG,PNG (example: "/home/duskybomb/pdf/invoice.pdf")
    templates : list of instances of class `InvoiceTemplate`, optional
        Templates are loaded using `read_template` function in `loader.py`
    input_module : {'pdftotext', 'pdfminer', 'tesseract'}, optional
        library to be used to extract text from given `invoicefile`,

    Returns
    -------
    dict or False
        extracted and matched fields or False if no template matches

    Notes
    -----
    Import required `input_module` when using invoice2data as a library

    See Also
    --------
    read_template : Function where templates are loaded
    InvoiceTemplate : Class representing single template files that live as .yml files on the disk

    Examples
    --------
    When using `invoice2data` as an library

    >>> from invoice2data.input import pdftotext
    >>> extract_data("invoice2data/test/pdfs/oyo.pdf", None, pdftotext)
    {'issuer': 'OYO', 'amount': 1939.0, 'date': datetime.datetime(2017, 12, 31, 0, 0), 'invoice_number': 'IBZY2087', 
    'currency': 'INR', 'desc': 'Invoice IBZY2087 from OYO'}

"""

based on Numpydoc. Please check IRC

duskybomb commented 6 years ago

Done and Closing! 🥂

m3nu commented 6 years ago

Hooray!