Closed sriprad closed 2 years ago
Hi @sriprad,
Thanks for your question, for the moment the lib does not include Key Information Extraction (if it is what you are talking about), we only have a "generic" OCR. We may/will consider in a few months adding "smarter" features with NLP, visual context, etc... But for now we do not support those tasks.
In the meantime, you are free to use our "raw" ocr predictions and train another model on top of it which will extract your information, thanks to geometrical/textual context provided by our detection and recognition models.
I hope this answer you question,
Have a nice day! :smile:
Hi @charlesmindee ,
Thank you :). Yes i meant the key information extraction. Great. Will look forward to it. Can you please share the link to the "raw" ocr predictions? i am not sure where to look up.
Have a nice day too :)
Thanks , Sri
Hi @sriprad :wave:
Here is a short snippet to parse your pdf (cf. README):
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images()
# Analyze
result = model(doc)
# If you want the JSON version
json_export = result.export()
Let me know if that's unclear :) Cheers!
Hi @fg-mindee ,
Thank you very much :). I replicated what you shared and got the output. Looks great. But it's a json output and i dont know how to convert into a dataframe. I tried read_json but not giving me result.
Please let me know if you have any solution :)
Regards, Sri
@sriprad oh my bad, would you mind specifying what you mean by "dataframe" please? If you could give short example, that would help a lot :pray:
hi @fg-mindee ,
sorry if i am not clear. I mean translating Json output to below format
@sriprad oh I see, the snippet I showed you was for raw OCR predictions! For now, the KIE (dataframe you're referring to) cannot be obtained through DocTR, so you'll have to post-process those results (try some regex for instance) to get these results!
Thank you @fg-mindee :)
@sriprad oh I see, the snippet I showed you was for raw OCR predictions! For now, the KIE (dataframe you're referring to) cannot be obtained through DocTR, so you'll have to post-process those results (try some regex for instance) to get these results!
why dont' we have a common method to create this functionality . i will give it a try if it works fine . i will raise PR
@VpkPrasanna well, the post-process is specific to the type of fields that you want So considering there is a large amount of options, it's hard to know! So we welcome external contributions on that part :)
@VpkPrasanna Any update for the PR?
Hi Team, Great work on this library . Is there way to extract the information from the pdf? For e.g i would like to extract Invoice No ; 23994 Date : 23/01/2021 etc. Is it possible to get the information in the invoice/pdf extracted in a pandas dataframe? Please let me know is there a code to extract these information. thanks