microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.02k stars 232 forks source link

How do I generate a dataframe after identifying the table structure? #112

Open hyshandler opened 1 year ago

hyshandler commented 1 year ago

I'm trying to generate a dataframe from a table in an image. After running the following code I was able to return the original image with (mostly) correct grid lines drawn on the table. How can I turn the coordinates in the image / gridlines and boxes into a pandas dataframe or csv? Thanks in advance!

model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-structure-recognition") feature_extractor = DetrFeatureExtractor() encoding = feature_extractor(image, return_tensors="pt") with torch.no_grad(): outputs = model(**encoding) target_sizes = [image.size[::-1]] results = feature_extractor.post_process_object_detection(outputs, threshold=0.8, target_sizes=target_sizes)[0]

WalidHadri-Iron commented 1 year ago

@hyshandler I guess either you use the implementation in this repo directly, because you have all the post-processing steps available until the construction of the dataframe, or you combine the output of that feature_extractor from huggingface with the post-processing function that you will find in this repo. Unfortunately, the post-processing steps to build the dataframe do not come with the huggingface transformers.

hyshandler commented 1 year ago

Thanks @WalidHadri-Iron for the response. Can you point me to the specific functions to construct that process? I'm having trouble finding the right ones in this repo. Thanks!

WalidHadri-Iron commented 1 year ago

@hyshandler the post-processing code is here https://github.com/microsoft/table-transformer/blob/main/src/postprocess.py and the steps are grouped here https://github.com/microsoft/table-transformer/blob/main/src/inference.py .

What you have in "results" is a bunch of bounding boxes with labels and scores, you need to post-process those bounding boxes based on their score, their label and the position of text you have in the image.

The function objects_to_structures groups almost the whole post-processsing https://github.com/microsoft/table-transformer/blob/235ad51dbef25d4165e6e1adff23453cc2ee490a/src/inference.py#L295

Then you have the three functions to get the format of outputs you want https://github.com/microsoft/table-transformer/blob/235ad51dbef25d4165e6e1adff23453cc2ee490a/src/inference.py#L359 https://github.com/microsoft/table-transformer/blob/235ad51dbef25d4165e6e1adff23453cc2ee490a/src/inference.py#L540 https://github.com/microsoft/table-transformer/blob/235ad51dbef25d4165e6e1adff23453cc2ee490a/src/inference.py#L512

Ashwani-Dangwal commented 1 year ago

Can the results from hugging face model be passed into structure_to_cells. If yes then how can i find those tokens that needs to be passed as parameters and how can i have them in the desired data structure.

amish1706 commented 1 year ago

Can the results from hugging face model be passed into structure_to_cells. If yes then how can i find those tokens that needs to be passed as parameters and how can i have them in the desired data structure.

You can keep the tokens argument as None if you don't have the text data of the cells. It works without tokens too.

Ashwani-Dangwal commented 1 year ago

@amish1706 I did keep the tokens as None but it was throwing me an error that tokens cannot be none. What is the work around for it?. Also if i dont pass in the tokens then how can be the text extracted on its own? I cant see any ocr in the code that will do the text extraction.

amish1706 commented 1 year ago
from transformers import TableTransformerForObjectDetection
from transformers import DetrImageProcessor
from inference import * 
from PIL import Image
import matplotlib.pyplot as plt

image = Image.open(img_path).convert("RGB")
encoding = feature_extractor(image, return_tensors="pt")
tokens = []

with torch.no_grad():
    outputs = model(**encoding)
outputs['pred_logits'] = outputs['logits']

objects = outputs_to_objects(outputs, image.size, model.config.id2label)
crops = objects_to_crops(image, tokens=tokens, objects=objects, class_thresholds={model.config.id2label[i]:0.5 for i in range(6)})
structures = objects_to_structures(objects, tokens, class_thresholds={model.config.id2label[i]:0.5 for i in range(6)})[0]
cells = structure_to_cells(structures, tokens)
visualize_cells(image, cells[0], f"outputs/{img_path.split('/')[-1]}")

Change in original inference code image

Ashwani-Dangwal commented 1 year ago

@amish1706 Maybe i was not able to clear my question. I wanted to know how can i convert the table from image into data frame as a csv output file without passing tokens. Because tokens are used everywhere in the code and its not feasible to change the entire code likewise.

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

If there is a way that can do the entire work of creating the csv file without passing the tokens then it will be very much helpful. Else i wanted to know a work around as to how can i solve the above error of concatenation some entries.

amish1706 commented 1 year ago

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

From what I understand and correct me if I'm wrong, you are saying that you have the bounding box of the text and the ocr results from Pytesseract. Now you want to use it with the inference code of the table-transformer.

I have only used it to get the table cells and the structure from it so I might not be able to help properly. image

I found this inside the code, you'll have to provide the tokens files in this format. Once check the PubTables -1m dataset which was used in the paper. Look for samples and convert your results in that format.

Ashwani-Dangwal commented 1 year ago

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

From what I understand and correct me if I'm wrong, you are saying that you have the bounding box of the text and the ocr results from Pytesseract. Now you want to use it with the inference code of the table-transformer.

I have only used it to get the table cells and the structure from it so I might not be able to help properly. image

I found this inside the code, you'll have to provide the tokens files in this format. Once check the PubLayNet dataset which was used in the paper. Look for samples and convert your results in that format.

I did manage to pass in the tokens too in sorted format as mentioned by the author. I am sending you some pics and its output as i cannot share them here and then you can understand it more clearly.

lionely commented 1 year ago

@Ashwani-Dangwal Hi I'm trying to use generate the tokens list in the format described on the Inference.MD.

There it says we only need the bounding box and text, not span. Were you able to figure this out? I generated in the format suggested but I end up with empty cells.

Ashwani-Dangwal commented 1 year ago

@lionely Other than the bounding box and the text you also need span_num, block_num and line_num. If you have scanned images then you can use pytesseract's image_to_data function to get all the values and if you have doc format file then you can simply use PyMuPDF to get the details. Hope this helps.

lionely commented 1 year ago

@Ashwani-Dangwal Thank you so much for your reply. I see how to get the block_num, line_num using pytesseract. But which one would be the span_num? Thanks again!

Ashwani-Dangwal commented 1 year ago

@lionely span_num would be the word_num in pytessearct. And you would also make the co-ordinates of the bounding boxes in xmin, ymin, xmax and ymax format since pytesseract gives the bounding boxes in x,y ,w and h format.

mahmoudshaddad commented 7 months ago

I am also trying to use this but what is class_thresholds i have this: results_list = [] for i in range(len(results['scores'])): label = results['labels'][i].item() score = results['scores'][i].item() bbox = results['boxes'][i].tolist() result_dict = {'label': label, 'score': score, 'bbox': bbox} results_list.append(result_dict) table_structures = objects_to_structures(results_list, results['boxes'], class_thresholds)