how to create the dataframe out of the JSON Output?

mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

https://mindee.github.io/doctr/

Apache License 2.0

3.61k stars 421 forks source link

how to create the dataframe out of the JSON Output? #509

Closed sriprad closed 2 years ago

sriprad commented 2 years ago

Hi Team, Great work on this library . Is there way to extract the information from the pdf? For e.g i would like to extract Invoice No ; 23994 Date : 23/01/2021 etc. Is it possible to get the information in the invoice/pdf extracted in a pandas dataframe? Please let me know is there a code to extract these information. thanks

charlesmindee commented 2 years ago

Hi @sriprad,

Thanks for your question, for the moment the lib does not include Key Information Extraction (if it is what you are talking about), we only have a "generic" OCR. We may/will consider in a few months adding "smarter" features with NLP, visual context, etc... But for now we do not support those tasks.

In the meantime, you are free to use our "raw" ocr predictions and train another model on top of it which will extract your information, thanks to geometrical/textual context provided by our detection and recognition models.

I hope this answer you question,

Have a nice day! :smile:

sriprad commented 2 years ago

Hi @charlesmindee ,

Thank you :). Yes i meant the key information extraction. Great. Will look forward to it. Can you please share the link to the "raw" ocr predictions? i am not sure where to look up.

Have a nice day too :)

Thanks , Sri

fg-mindee commented 2 years ago

Hi @sriprad :wave:

Here is a short snippet to parse your pdf (cf. README):

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images()
# Analyze
result = model(doc)
# If you want the JSON version
json_export = result.export()

Let me know if that's unclear :) Cheers!

sriprad commented 2 years ago

Hi @fg-mindee ,

Thank you very much :). I replicated what you shared and got the output. Looks great. But it's a json output and i dont know how to convert into a dataframe. I tried read_json but not giving me result.

Please let me know if you have any solution :)

Regards, Sri

fg-mindee commented 2 years ago

@sriprad oh my bad, would you mind specifying what you mean by "dataframe" please? If you could give short example, that would help a lot :pray:

sriprad commented 2 years ago

hi @fg-mindee ,

sorry if i am not clear. I mean translating Json output to below format

fg-mindee commented 2 years ago

@sriprad oh I see, the snippet I showed you was for raw OCR predictions! For now, the KIE (dataframe you're referring to) cannot be obtained through DocTR, so you'll have to post-process those results (try some regex for instance) to get these results!

sriprad commented 2 years ago

Thank you @fg-mindee :)

VpkPrasanna commented 2 years ago

@sriprad oh I see, the snippet I showed you was for raw OCR predictions! For now, the KIE (dataframe you're referring to) cannot be obtained through DocTR, so you'll have to post-process those results (try some regex for instance) to get these results!

why dont' we have a common method to create this functionality . i will give it a try if it works fine . i will raise PR

frgfm commented 2 years ago

@VpkPrasanna well, the post-process is specific to the type of fields that you want So considering there is a large amount of options, it's hard to know! So we welcome external contributions on that part :)

ariefwijaya commented 8 months ago

@VpkPrasanna Any update for the PR?