microsoft / table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
MIT License
2.2k stars 247 forks source link

Colab Notebook TSR: functional analysis and obtain final dataframe #84

Open emigomez opened 1 year ago

emigomez commented 1 year ago

Hi!

I was working with the TD and TSR notebooks https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Table%20Transformer, and they work properly for me, but the last step of the TSR pipeline to obtain a data frame is not implemented in these notebooks (I think, this process is called functional analysis in this repo). The postprocessing steps of the TSR pass from the structure to grid cells.

Was anyone capable to obtain well the final data frame for TSR in colab? Taking into account spanning cells and titles.

Regards

JaMe76 commented 1 year ago

Check this space on Huggingface where you can find a clean implementation of the steps you were missing. The main part can be found in app.py

emigomez commented 1 year ago

Thank you for your response @JaMe76 !!

I have worked with this script for postprocessing before, but I think that some parts are missing. From the results that I have obtained using the notebooks and this app.py functions, I believe that the final data frame of app.py doesn't take into account TSR labels as 'spanning cell'. I show one example below.

TSR results: image image image image image

app.py postprocessing result: image

As you can see in this example, as the postprocessing is not taking into account the 'spanning cells' the result is going to be bad.

Let me know please if I'm doing something bad with this app.py postprocessing, or do you have the same problems