infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
11.1k stars 1.08k forks source link

[Question]: How to fine-tune deepdoc on additional data #569

Open q0oz opened 2 months ago

q0oz commented 2 months ago

Describe your problem

Hi there!

I've been trying to use deepdoc (mainly layout) functionality for predicting the structure of scientific PDFs. The quality of the recognition was not satisfactory, so I thought that additional training on our data might help. Is it possible to do that?

Additional questions: is it possible to see the training code for deepdoc and to know what data you used for training?

Thank you!

KevinHuSh commented 2 months ago

Describe your problem

Hi there!

I've been trying to use deepdoc (mainly layout) functionality for predicting the structure of scientific PDFs. The quality of the recognition was not satisfactory, so I thought that additional training on our data might help. Is it possible to do that?

Additional questions: is it possible to see the training code for deepdoc and to know what data you used for training?

Thank you!

We used public data like CDLA and PubTables to train our model. We will open our trainning code in the feature.

nhha1602 commented 1 month ago

It is very good to hear that your team will share your training code about deepdoc. Thank you.