Open elnazsn1988 opened 4 years ago
I adapted the code from preprocessing.py to work with our data to get the same output as preprocessing.py does with the FUNSD. I have my data from pytesseract.image_to_data.
@r000bin Thanks very much for your response, the issue is that when I pytesseract a document image, I dont always get the correct text - definitley not if its in a similar format to that of FUNSD - so how could we get such a high accuracy level of texts with coordinates and bounding boxes? example is attached - unless there's something wrong with my hOCR extraction and output from tesseract, could you possibly confirm if you get the same output? input:
output from command pasted above : print(pytesseract.image_to_data(Image.open('1a.png')))
The resolution is so bad I can't even read it by myself. how should tesseract be better than that? I'm no tesseract expert but the best approach to optimize tesseract seems to be better quality. Were you starting with a pdf and did a poor converting?
@r000bin We use the original images in the IIT-CDIP dataset, which has higher resolution than those in the RVL-CDIP dataset.
@wolfshow @r000bin thanks for the comments, makes sense - I needed higher image resolution. I've removed label annotations from the test.txt file to test the prediction function with --do_predict, however due to lack of labels I get stuck in the evlauation stage with error :
File "run_seq_labeling.py", line 815, in <module>
main()
File "run_seq_labeling.py", line 776, in main
args, model, tokenizer, labels, pad_token_label_id, mode="test"
File "run_seq_labeling.py", line 311, in evaluate
eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode=mode)
File "/opt/conda/lib/python3.7/site-packages/layoutlm-0.0-py3.7.egg/layoutlm/data/funsd.py", line 29, in __init__
File "/opt/conda/lib/python3.7/site-packages/layoutlm-0.0-py3.7.egg/layoutlm/data/funsd.py", line 174, in read_examples_from_file
AssertionError
Have you been able to skip this for prediction?
@elnazsn1988 now I get your point. Seems that the code does not support that at the moment. I will be soon at the same point and I would just set a O label for everything for predicting new samples just to keep going an do this as a work around.
@wolfshow @r000bin @elnazsn1988 @MohitTuli Hey anyone figure it out how to do prediction on new images. I am also facing lots of problem to understand how the flow is going in prediction.
I managed now to predict on new images. I prepared them the same way I prepared my train/test data and also set a label for every word which I ignore at the end and just look at the prediction.
@r000bin so while prediction on new image you are only passing the image or you are creating annotation file too. if you don't mind please share some portion of code like how you giving the new image.
@r000bin i am also confused about mapping question to answer.
You need to do OCR on the image first and then get it into the right shape for LayoutLM. The layoutlm/examples/seq_labeling/preprocess.py is a pretty good example how to get there. You have already fine-tunes your LayoutLM?
I am using fusdn dataset. I have already fine-tune and evaluate on that. But now question is if new image is coming. 1 how will i make annotation file. Are you saying this can be achieved by the ocr. Then which ocr you are using. I have tried tesseract but sometimes it fails.
Thanks i understand.
@kbrajwani are you able to predict any images? If so can you help me out as well? As I am struggling with this for a long time now.
@devildani hey its simple you have to use any ocr system whhich will give you word level bounding box and transcription like tesseract. Then you can make a script to format you ocr output like FUNSD. you can see the format here. https://guillaumejaume.github.io/FUNSD/description/ After that you can preprocess by funsd_preprocess.py and then use --do_predict .
@kbrajwani Thanks for replying. I was going through the format of the link that you shared. Do you what linking
in that format is? Also, while predicting we won't be having any labels for the words will that be okay?
@devildani sorry for delay. you can leave linking and for labels you can give others labels to every word. at the time of prediction model takes word and coordinate from file nothing else.
Sorry to trouble.
When we want to predict a new image, must there be a corresponding test.txt
file?
I managed now to predict on new images. I prepared them the same way I prepared my train/test data and also set a label for every word which I ignore at the end and just look at the prediction.
Hi @r000bin, I did as you mentioned but I am getting the follwoing error: File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 410, in average "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized
Any idea how to solve it?
Thank you
I managed now to predict on new images. I prepared them the same way I prepared my train/test data and also set a label for every word which I ignore at the end and just look at the prediction.
Hi @r000bin, I did as you mentioned but I am getting the follwoing error: File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 410, in average "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized
I am getting the same error when tried to predict on a new image. I was able to find a workaround where I use the cache file of the testing dataset(FUNSD) and put it in the same folder where the new image test.txt
file is present to predict.
But the results I get is for only a part of the tokens I give as an input the rest of the tokens is truncated from the output. Changing the max_seq_length does not help with the result.
Hi - Apologies for the avalanches of question posted, I have a read your paper https://arxiv.org/pdf/1912.13318.pdf and also studied the funsd, previously using it to train a fasterrcnn and predict Q and A's. One thing I can not understand from the paper and this repo is, how do we ingest new images to predict? the paper says:
"To utilize the layout information of each document, we need toobtain the location of each token. However, the pre-training dataset(IIT-CDIP Test Collection) only contains pure texts while missing their corresponding bounding boxes. In this case, we re-process thescanned document images to obtain the necessary layout informa-tion. Like the original pre-processing in IIT-CDIP Test Collection,we similarly process the dataset by applying OCR to documentimages. The difference is that we obtain both the recognized wordsand their corresponding locations in the document image. Thanksto Tesseract6, an open-source OCR engine, we can easily obtain therecognition as well as the 2-D positions. We store the OCR results inhOCR format, a standard specification format which clearly definesthe OCR results of one single document image using a hierarchical representation"
However, although with tesseract I can get the words, bouning boxes and hierarchies, it does not provide an annotated docuement as input (as shown in repo example for test):
https://github.com/microsoft/unilm/pull/155
How do we go from base input document in hOCR format of words, hierarchies and coordinates to the annotation format used in the example here? The --do_predict calls the test.txt from the data folder, which already has questions and answers next to the words.