Closed Valeriia1993 closed 1 year ago
If you want to annotate this data with Sparrow UI and export it in this JSON format. You need to update with this logic export_labels function in Sparrow UI.
If you already have this data in JSON format, then there are no changes in Sparrow Data required, you should run:
python run_donut.py
python run_donut_upload.py
python run_donut_test.py
See Sparrow Data readme file.
I have many pdfs(multi pages) and json that mentioned above. I plan to use this data as a dataset. Should I convert to images from pdfs?
Current Sparrow code here on GitHub works with single page PDFs only. I was asking in Donut GitHub repo - if Donut works with multi-pages, there was no answer. Out of the box, it doesn't work with multi-pages. The solution I'm currently implementing is not convert PDF multi-page into a single image and feed it into Donut. But this code is not ready yet to be published.
Yes, PDFs should be converted to images, Donut works with images.
Regarding German language - you should check Donut GitHub repo, they describe how to fine-tune it for additional languages.
I am trying to get result like the following json: { "INVOICE_HEADER": { "INVOICE_INFO": { "INVOICE_DATE": "2023-07-28", "INVOICE_ID": "L05-4254515", "INVOICE_ISSUER_IDREF": 442985, "INVOICE_RECIPIENT_IDREF": { "content": "41420-0000411428-89", "type": "netcomID" }, "HEADER_UDX": { "RE_PK": 1010, "INV_REFERENCE_NUMBER": "N/A", "INV_NET_AMOUNT2": "N/A", "INV_NET_AMOUNT3": "N/A", "INV_TAX_RATE2": "N/A", "QR_IBAN": "CH1130778010700502202", "QR_REFERENCE": 4.017800000004255e+25, "INV_TAX_RATE3": "N/A", "DocumentNr": 60124433, "PON": "N/A", "INV_TAX_AMOUNT2": "N/A", "INV_TAX_AMOUNT3": "N/A", "QR_INFORMATION": "N/A", "INV_IS_MM": 0, "RE_ILN": 7610227000016, "INV_DELIVERY_DATE": "26.07.2023", "RE_RECIPIENT_NO": 8, "ESR_ROW": "N/A", "INV_CREDIT_NOTE": 0 }, "CURRENCY": "CHF", "PARTIES": { "PARTY": [ { "PARTY_ROLE": "invoice_issuer", "ADDRESS": { "TAX_NUMBER": "CHE-104.537.601", "CITY": "Weiningen ZH", "VAT_ID": "N/A", "NAME": "Auto AG Truck", "STREET": "Im Gewerbepark 1", "NAME2": "N/A", "COUNTRY": "CH", "ZIP": 8104 }, "PARTY_ID": [ 442985, { "content": "41001-0000415300-14", "type": "netcomID" } ] }, { "PARTY_ROLE": "invoice_recipient", "ADDRESS": { "CITY": "Schaan", "VAT_ID": "LI50552", "NAME": "Hilcona AG", "STREET": "Bendererstrasse 21", "COUNTRY": "LI", "ZIP": 9494 }, "PARTY_ID": "41420-0000411428-89" } ] } } }, "INVOICE_ITEM_LIST": { "INVOICE_ITEM": { "ITEM_UDX": { "INVI_ORI_ARTICLE_NO": "206-555-04", "OR_ORDER_NO": "206-555-0144", "INVI_ORDER_NO": "206-555-0144", "OR_DELIVERY_DATE": "2019-11-11", "OR_DELIVERY_NO": 1, "OR_TOTAL_NET_PRICE": 100 }, "QUANTITY": 1, "LINE_ITEM_ID": 1, "PRICE_LINE_AMOUNT": 110, "PRODUCT_ID": { "DESCRIPTION_SHORT": "iom_dummy" }, "ORDER_UNIT": "C62", "PRODUCT_PRICE_FIX": { "PRICE_AMOUNT": 110 } } }, "version": 2.1, "INVOICE_SUMMARY": { "TOTAL_TAX": { "TAX_DETAILS_FIX": { "TAX": 7.7, "TAX_AMOUNT": 41.75 } }, "NET_VALUE_GOODS": 542.29, "TOTAL_ITEM_NUM": 1, "TOTAL_AMOUNT": 584.05 } } To do that, I think sparrow-data needs to be changed. The pdfs are in German language but the invoice is in English.