kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 449 forks source link

Questions #1180

Open flckv opened 4 days ago

flckv commented 4 days ago

Do you have some demonstration on what cases does grobid fail with crf and where delft is better, please?

You mention in the documentation: "current GROBID cheap approach" - were you refering to the delft or the crf method or both? Is the delft grobid method cheaper than e.g. VILA still?

"For the moment, we are also not relying on transformer approaches incorporating layout information, like LayoutML (Xu et al., 2020), LayoutLMv2 (Xu et al., 2021), SelfDoc or VILA (Shen et al., 2021), which require considerable GPU capacities, long inference runtime, and do not show at this time convincing accuracy scores as compared to the current GROBID cheap approach (reported accuracy at token level are often lower than GROBID accuracy at field level, while using less labels)."https://grobid.readthedocs.io/en/latest/Principles/#layout-tokens-not-text

“see here (11.3M PDF were processed in 6 days by 2 servers without interruption”https://github.com/kermitt2/grobid Do you mean by Delft or by crf?

lfoppiano commented 4 days ago

Hi @flckv!

Do you have some demonstration on what cases does grobid fail with crf and where delft is better, please?

In general this kind of assessment was done in the past. Nowadays we just use the best models for the evaluation.

If you want to do some archeology, I think you can find something for version 0.7.3 here:

Bear in mind they are from different months, so there might be biased.

Usually the difference is more marked in Citation and Header extraction. For fulltext, we always use CRF.

kermitt2 commented 4 days ago

In this text:

"For the moment, we are also not relying on transformer approaches incorporating layout information, like LayoutML (Xu et al., 2020), LayoutLMv2 (Xu et al., 2021), SelfDoc or VILA (Shen et al., 2021), which require considerable GPU capacities, long inference runtime, and do not show at this time convincing accuracy scores as compared to the current GROBID cheap approach (reported accuracy at token level are often lower than GROBID accuracy at field level, while using less labels)."https://grobid.readthedocs.io/en/latest/Principles/#layout-tokens-not-text

... the cheap approach here is with DeLFT Deep Learning models (BidLSTM_CRF_FEATURES models), in contrast to LayoutML and VILA models. These DeLFT DL models, which are still used by default in Grobid, can run very well with 4GB VRAM (1050 for example) with several ten thousand tokens processed per seconds. They are also quite okay with CPU only (the demo on HuggingFace uses DeLFT models on CPU only). In comparison, VILA hardly works at more than 200 tokens per second on A100.
Grobid is also structuring much more fine-grained than LayoutML and VILA for scholar papers. There is currently in Grobid no case where fine-tuned BERT base models work better than BidLSTM_CRF_FEATURES (except maybe header model, but it's more a tie).

“see https://github.com/kermitt2/grobid/issues/443#issuecomment-505208132 (11.3M PDF were processed in 6 days by 2 servers without interruption”https://github.com/kermitt2/grobid Do you mean by Delft or by crf?

In this exercise (from 2019), only CRF models were used, without GPU.

lfoppiano commented 4 days ago

Ah indeed 😄 I corrected my answer 😅