huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.54k stars 25.5k forks source link

[New Model] DocFormer: End-to-End Transformer for Document Understanding #14456

Open athewsey opened 2 years ago

athewsey commented 2 years ago

🌟 New model addition

Model description

See "DocFormer: End-to-End Transformer for Document Understanding", Appalaraju et al (ICCV 2021) on CVF and arXiv

DocFormer is a multi-modal transformer model for 2D/visual documents from Amazon (where, fair disclosure, I also currently work but not in research) - which I would characterize at a high level as being broadly along the same use cases as LayoutLMv2 (already in transformers), but achieving better (state-of-the-art) results with smaller datasets per the benchmarks in the paper.

I've found this kind of multi-modal, spatial/linguistic model very useful in the past (actually released an AWS sample and blog post with Hugging Face LayoutLMv1 earlier this year) and would love the improvements from DocFormer could be available through HF Transformers.

Open source status

shabie commented 2 years ago

Haha thank you for this issue! Tagging @uakarsh since both of us have managed to get the architecture largely down (we think!)

It would be awesome to get this integrated with some help :)

Directly inspired by the journey of @NielsRogge

uakarsh commented 2 years ago

@shabie Thanks for the tag. @athewsey, as far as the weights are concerned, I have tried implementing their MLM task (described in the repo), as well as Image Reconstruction Part (for the Unsupervised Case), and on the basis of the performance, I can say that it is working nearly close to that of the paper. So, we are hoping to release it as soon as possible. I am quite excited to share the model with the community since this is my first transformer(along with @shabie) implementation and nothing can be more excited than this. However, there are some approximations in the model, which may affect performance, but we would try to get the results as close as possible. Cheers,

NielsRogge commented 2 years ago

Hi,

DocFormer would indeed be a great addition to the library. Note that pretrained weights are required for a model to be added.

Looking forward to this!

shabie commented 2 years ago

Hi,

DocFormer would indeed be a great addition to the library. Note that pretrained weights are required for a model to be added.

Looking forward to this!

@NielsRogge Thank you for the quick reply!

Its very clear to us that weights are needed. That's the reason we didn't create this new model issue so far. That is not to say that wasn't a good idea @athewsey!

So the two challenges in getting weights is compute and data.

Compute may be manageable but the main problem right now is that the OCR to be performed to extract words and their bounding boxes on RVL-CDIP dataset. The thing is pytesseract is ridiculously slow. I think pytesseract is just generally a poor implementation given its disk bound operations.

I didn't get the chance earlier but I was about to ask you if you guys have the dataset with OCR step completed and if that could also be made available. That would speed up things a lot. If not, we'd have to first overcome this hurdle which is where we're at basically. We'd need some kind of distributed computation (like a spark cluster job) to get this task completed in manageable time.

uakarsh commented 2 years ago

As an update, the authors would be sharing the Textract OCR for the RVL CDIP Dataset, and as soon as they release it, we would try to achieve the benchmark performance as mentioned in the paper. However, we are also trying from our end, to make our own OCR part, and then perfrom pre train and fine tune

pzdkn commented 2 years ago

Any updates on this?

uakarsh commented 2 years ago

Have completed the scripts for pre-training on MLM, and using DocFormer for Document Image Classification. Check it out here DocFormer Examples with PyTorch-Lightening

WaterKnight1998 commented 1 year ago

Any updates on this? It would be very useful @uakarsh @shabie @athewsey @NielsRogge . LayoutLMV3 is cool but license doesn't allow commercial usage

uakarsh commented 1 year ago

Hi @WaterKnight1998 we have been able to train the model, you can find it here.

The list of things done till now are:

Due to limited resources, currently, I have been able to make the first two points and tried to show a demo of the same here and if @NielsRogge suggests, we can indeed integrate it with Hugging Face, since it would be easy to do so

Thanks,

WaterKnight1998 commented 1 year ago

@uakarsh I can help if you need help. Can this model be used for token classification?

uakarsh commented 1 year ago

Sure, with some modifications to the script of Document Image Classification and pre-processing, we would definitely be able to use it for token classification

vprecup commented 1 year ago

Hello there, @uakarsh. Has this initiative of integrating DocFormer into Transformers been discontinued in the meantime?

uakarsh commented 1 year ago

Hi @vprecup, thanks for your comment, it really made me feel happy that, you are interested in integrating DocFormer into hugging face. However, the problem is, as a student, I don't have that much computing to pre-train the model. As mentioned in the paper, they took 5M documents (pg. 6, above section 4.2), and have not specified the data. I believe the current IDL Dataset would be sufficient for the pre-train dataset, and we have a demo notebook for pre-training.

So, maybe if somebody can do that, I can help them.

By the way, one interesting thing, In the DocFormer paper, on pg. 7, Table 6, without pre-training, the authors get an F1 Score of 4.18 on FUNSD (100 Epochs), while in our notebook, we get 13.29 (3x improvement on 100 Epochs), and it overfits, so maybe the implementation is good to go for your use case.

Thanks, Akarsh

mbertani commented 1 year ago

Hi @uakarsh, if we could get you some compute power, would you like to give it a go?

It seems I can borrow a Z8 Fury workstation from HP, equipped with up to four of the latest NVIDIA RTX 6000 Ada generation GPUs, each boasting 48GB of VRAM. Additionally, it features Intel's most powerful CPU, potentially with up to 56 cores, and the option to be fully loaded with 2TB of RAM.

Creating the weights for the DocFormer should be a good use of this machine. What is your time availability?

uakarsh commented 1 year ago

Hi @mbertani, sorry for the late reply. If it is possible, I would surely like to give it a go. As of my experience with GPUs, I have worked on a DGX workstation, and I believe, the configurations you mentioned would work fine.

By time availability, do you mean to have a meet to discuss the plan further?

By that time, I would be working on arranging the code required for the pre-training as well as coming up with the plan about how to go next. I do have slight experience on pre-training (had pre-trained LayoutLMv3, and some related models for use case), so I can plan things and test them.

mbertani commented 1 year ago

OK, good, then we can setup a meeting to discuss how we proceed. So as not to share emails on public forums, I can share with you my LI profile and we take it from there?

https://www.linkedin.com/in/marcobertaniokland/

uakarsh commented 1 year ago

Sure

ThorJonsson commented 11 months ago

Any update on this?