microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.12k stars 2.55k forks source link

How to use new image for prediction #213

Open elnazsn1988 opened 4 years ago

elnazsn1988 commented 4 years ago

Hi - Apologies for the avalanches of question posted, I have a read your paper https://arxiv.org/pdf/1912.13318.pdf and also studied the funsd, previously using it to train a fasterrcnn and predict Q and A's. One thing I can not understand from the paper and this repo is, how do we ingest new images to predict? the paper says:

"To utilize the layout information of each document, we need toobtain the location of each token. However, the pre-training dataset(IIT-CDIP Test Collection) only contains pure texts while missing their corresponding bounding boxes. In this case, we re-process thescanned document images to obtain the necessary layout informa-tion. Like the original pre-processing in IIT-CDIP Test Collection,we similarly process the dataset by applying OCR to documentimages. The difference is that we obtain both the recognized wordsand their corresponding locations in the document image. Thanksto Tesseract6, an open-source OCR engine, we can easily obtain therecognition as well as the 2-D positions. We store the OCR results inhOCR format, a standard specification format which clearly definesthe OCR results of one single document image using a hierarchical representation"

However, although with tesseract I can get the words, bouning boxes and hierarchies, it does not provide an annotated docuement as input (as shown in repo example for test):

https://github.com/microsoft/unilm/pull/155

pred_dir = "predictions"

!python run_seq_labeling.py  --do_predict \
                            --data_dir data \
                            --model_type layoutlm \
                            --model_name_or_path {out_dir} \
                            --do_lower_case \
                            --output_dir predictions \
                            --labels data/labels.txt \
                            --fp16

How do we go from base input document in hOCR format of words, hierarchies and coordinates to the annotation format used in the example here? The --do_predict calls the test.txt from the data folder, which already has questions and answers next to the words.

r000bin commented 4 years ago

I adapted the code from preprocessing.py to work with our data to get the same output as preprocessing.py does with the FUNSD. I have my data from pytesseract.image_to_data.

elnazsn1988 commented 4 years ago

@r000bin Thanks very much for your response, the issue is that when I pytesseract a document image, I dont always get the correct text - definitley not if its in a similar format to that of FUNSD - so how could we get such a high accuracy level of texts with coordinates and bounding boxes? example is attached - unless there's something wrong with my hOCR extraction and output from tesseract, could you possibly confirm if you get the same output? input:

image

output from command pasted above : print(pytesseract.image_to_data(Image.open('1a.png'))) image

r000bin commented 4 years ago

The resolution is so bad I can't even read it by myself. how should tesseract be better than that? I'm no tesseract expert but the best approach to optimize tesseract seems to be better quality. Were you starting with a pdf and did a poor converting?

wolfshow commented 4 years ago

@r000bin We use the original images in the IIT-CDIP dataset, which has higher resolution than those in the RVL-CDIP dataset.

elnazsn1988 commented 4 years ago

@wolfshow @r000bin thanks for the comments, makes sense - I needed higher image resolution. I've removed label annotations from the test.txt file to test the prediction function with --do_predict, however due to lack of labels I get stuck in the evlauation stage with error :

File "run_seq_labeling.py", line 815, in <module>
    main()
  File "run_seq_labeling.py", line 776, in main
    args, model, tokenizer, labels, pad_token_label_id, mode="test"
  File "run_seq_labeling.py", line 311, in evaluate
    eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode=mode)
  File "/opt/conda/lib/python3.7/site-packages/layoutlm-0.0-py3.7.egg/layoutlm/data/funsd.py", line 29, in __init__
  File "/opt/conda/lib/python3.7/site-packages/layoutlm-0.0-py3.7.egg/layoutlm/data/funsd.py", line 174, in read_examples_from_file
AssertionError

Have you been able to skip this for prediction?

r000bin commented 4 years ago

@elnazsn1988 now I get your point. Seems that the code does not support that at the moment. I will be soon at the same point and I would just set a O label for everything for predicting new samples just to keep going an do this as a work around.

kbrajwani commented 4 years ago

@wolfshow @r000bin @elnazsn1988 @MohitTuli Hey anyone figure it out how to do prediction on new images. I am also facing lots of problem to understand how the flow is going in prediction.

r000bin commented 4 years ago

I managed now to predict on new images. I prepared them the same way I prepared my train/test data and also set a label for every word which I ignore at the end and just look at the prediction.

kbrajwani commented 4 years ago

@r000bin so while prediction on new image you are only passing the image or you are creating annotation file too. if you don't mind please share some portion of code like how you giving the new image.

kbrajwani commented 4 years ago

@r000bin i am also confused about mapping question to answer.

r000bin commented 4 years ago

You need to do OCR on the image first and then get it into the right shape for LayoutLM. The layoutlm/examples/seq_labeling/preprocess.py is a pretty good example how to get there. You have already fine-tunes your LayoutLM?

kbrajwani commented 4 years ago

I am using fusdn dataset. I have already fine-tune and evaluate on that. But now question is if new image is coming. 1 how will i make annotation file. Are you saying this can be achieved by the ocr. Then which ocr you are using. I have tried tesseract but sometimes it fails.

  1. If i done prediction on one image the model will give me all questions and answers entity. so how can i map that which answer is for which questions.
r000bin commented 4 years ago
  1. Use tesseract and get all the words together with the boundingbox. Adapt the code in layoutlm/examples/seq_labeling/preprocess.py to get the tesseract output in the right format.
  2. I'm not working with question/answer pairs. In this issue it was mentioned to use another binary classifier: https://github.com/microsoft/unilm/issues/161
kbrajwani commented 4 years ago

Thanks i understand.

pratik-dani commented 3 years ago

@kbrajwani are you able to predict any images? If so can you help me out as well? As I am struggling with this for a long time now.

kbrajwani commented 3 years ago

@devildani hey its simple you have to use any ocr system whhich will give you word level bounding box and transcription like tesseract. Then you can make a script to format you ocr output like FUNSD. you can see the format here. https://guillaumejaume.github.io/FUNSD/description/ After that you can preprocess by funsd_preprocess.py and then use --do_predict .

pratik-dani commented 3 years ago

@kbrajwani Thanks for replying. I was going through the format of the link that you shared. Do you what linking in that format is? Also, while predicting we won't be having any labels for the words will that be okay?

kbrajwani commented 3 years ago

@devildani sorry for delay. you can leave linking and for labels you can give others labels to every word. at the time of prediction model takes word and coordinate from file nothing else.

lvbohui commented 3 years ago

Sorry to trouble. When we want to predict a new image, must there be a corresponding test.txt file?

ssherlins commented 3 years ago

I managed now to predict on new images. I prepared them the same way I prepared my train/test data and also set a label for every word which I ignore at the end and just look at the prediction.

Hi @r000bin, I did as you mentioned but I am getting the follwoing error: File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 410, in average "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized

Any idea how to solve it?

Thank you

prachiarya15 commented 3 years ago

I managed now to predict on new images. I prepared them the same way I prepared my train/test data and also set a label for every word which I ignore at the end and just look at the prediction.

Hi @r000bin, I did as you mentioned but I am getting the follwoing error: File "/usr/local/lib/python3.7/dist-packages/numpy/lib/function_base.py", line 410, in average "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized

I am getting the same error when tried to predict on a new image. I was able to find a workaround where I use the cache file of the testing dataset(FUNSD) and put it in the same folder where the new image test.txt file is present to predict.

But the results I get is for only a part of the tokens I give as an input the rest of the tokens is truncated from the output. Changing the max_seq_length does not help with the result.