impira / docquery

An easy way to extract information from documents
MIT License
1.72k stars 127 forks source link

DocQuery has difficulty pulling concepts from different parts of a document #24

Open Tylersuard opened 2 years ago

Tylersuard commented 2 years ago

https://www.animalearn.org/img/pdf/animalFacts.pdf

Question: which animals are mentioned in this document? Docsign's answer: Tiny animals! Correct answer: Cat, rat, pig, earthworm, crayfish.

ankrgyl commented 2 years ago

This is an excellent example. The current models that DocQuery comes with (both LayoutLM and Donut) are designed specifically for answering "short" questions that are assumed to be consecutive. I think to be good at a task like this, we'd need to train the models quite differently.

Could you share more details about the use case? I can either recommend some other models to look into (e.g. using NER to classify all of the animals mentioned in your documents) or we can keep this on the backburner as something to look at on the modeling side.