katanaml / sparrow

Data processing with ML, LLM and Vision LLM
https://katanaml.io
GNU General Public License v3.0
3.73k stars 379 forks source link

How can make sub group #27

Closed Valeriia1993 closed 1 year ago

Valeriia1993 commented 1 year ago

I am trying to get result like the following json: { "INVOICE_HEADER": { "INVOICE_INFO": { "INVOICE_DATE": "2023-07-28", "INVOICE_ID": "L05-4254515", "INVOICE_ISSUER_IDREF": 442985, "INVOICE_RECIPIENT_IDREF": { "content": "41420-0000411428-89", "type": "netcomID" }, "HEADER_UDX": { "RE_PK": 1010, "INV_REFERENCE_NUMBER": "N/A", "INV_NET_AMOUNT2": "N/A", "INV_NET_AMOUNT3": "N/A", "INV_TAX_RATE2": "N/A", "QR_IBAN": "CH1130778010700502202", "QR_REFERENCE": 4.017800000004255e+25, "INV_TAX_RATE3": "N/A", "DocumentNr": 60124433, "PON": "N/A", "INV_TAX_AMOUNT2": "N/A", "INV_TAX_AMOUNT3": "N/A", "QR_INFORMATION": "N/A", "INV_IS_MM": 0, "RE_ILN": 7610227000016, "INV_DELIVERY_DATE": "26.07.2023", "RE_RECIPIENT_NO": 8, "ESR_ROW": "N/A", "INV_CREDIT_NOTE": 0 }, "CURRENCY": "CHF", "PARTIES": { "PARTY": [ { "PARTY_ROLE": "invoice_issuer", "ADDRESS": { "TAX_NUMBER": "CHE-104.537.601", "CITY": "Weiningen ZH", "VAT_ID": "N/A", "NAME": "Auto AG Truck", "STREET": "Im Gewerbepark 1", "NAME2": "N/A", "COUNTRY": "CH", "ZIP": 8104 }, "PARTY_ID": [ 442985, { "content": "41001-0000415300-14", "type": "netcomID" } ] }, { "PARTY_ROLE": "invoice_recipient", "ADDRESS": { "CITY": "Schaan", "VAT_ID": "LI50552", "NAME": "Hilcona AG", "STREET": "Bendererstrasse 21", "COUNTRY": "LI", "ZIP": 9494 }, "PARTY_ID": "41420-0000411428-89" } ] } } }, "INVOICE_ITEM_LIST": { "INVOICE_ITEM": { "ITEM_UDX": { "INVI_ORI_ARTICLE_NO": "206-555-04", "OR_ORDER_NO": "206-555-0144", "INVI_ORDER_NO": "206-555-0144", "OR_DELIVERY_DATE": "2019-11-11", "OR_DELIVERY_NO": 1, "OR_TOTAL_NET_PRICE": 100 }, "QUANTITY": 1, "LINE_ITEM_ID": 1, "PRICE_LINE_AMOUNT": 110, "PRODUCT_ID": { "DESCRIPTION_SHORT": "iom_dummy" }, "ORDER_UNIT": "C62", "PRODUCT_PRICE_FIX": { "PRICE_AMOUNT": 110 } } }, "version": 2.1, "INVOICE_SUMMARY": { "TOTAL_TAX": { "TAX_DETAILS_FIX": { "TAX": 7.7, "TAX_AMOUNT": 41.75 } }, "NET_VALUE_GOODS": 542.29, "TOTAL_ITEM_NUM": 1, "TOTAL_AMOUNT": 584.05 } } To do that, I think sparrow-data needs to be changed. The pdfs are in German language but the invoice is in English.

abaranovskis-redsamurai commented 1 year ago

If you want to annotate this data with Sparrow UI and export it in this JSON format. You need to update with this logic export_labels function in Sparrow UI.

If you already have this data in JSON format, then there are no changes in Sparrow Data required, you should run:

python run_donut.py
python run_donut_upload.py
python run_donut_test.py

See Sparrow Data readme file.

Valeriia1993 commented 1 year ago

I have many pdfs(multi pages) and json that mentioned above. I plan to use this data as a dataset. Should I convert to images from pdfs?

abaranovskis-redsamurai commented 1 year ago

Current Sparrow code here on GitHub works with single page PDFs only. I was asking in Donut GitHub repo - if Donut works with multi-pages, there was no answer. Out of the box, it doesn't work with multi-pages. The solution I'm currently implementing is not convert PDF multi-page into a single image and feed it into Donut. But this code is not ready yet to be published.

Yes, PDFs should be converted to images, Donut works with images.

Regarding German language - you should check Donut GitHub repo, they describe how to fine-tune it for additional languages.