kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.6k stars 460 forks source link

Fix affiliation missing when using DL affiliation-address model #1166

Closed lfoppiano closed 2 months ago

lfoppiano commented 2 months ago

This PR propose a fix for the affiliation, that are lost when processing them with a DL model.

The issue seems to be in the method: getAffiliationBlocksFromSegments() where new \n are added (in general they should be added if there is a misalignment, however they are added for sure at the beginning).

https://github.com/kermitt2/grobid/blob/a95d2533f1019e900b49ea5c39a5afe355dbb4a3/grobid-core/src/main/java/org/grobid/core/engines/AffiliationAddressParser.java#L81

I patched quickly by checking that end is not zero. However this \n does not work well with the DL models, at contrary with the CRF models that they are ignoring it.

I've left two tests which are showing the problem from both CRF and DL: https://github.com/kermitt2/grobid/blob/bd93a61f4542f218299e2c34a82c37b75bc727ef/grobid-core/src/test/java/org/grobid/core/engines/AffiliationAddressParserTest.java#L262

The DL test is still failing, as I'm not sure really where to fix the issue.

After this is fix we would need to rebuild the grobid-full image.

lfoppiano commented 2 months ago

After a few iteration over it, I think I understood the principle which is of separating blocks of affiliations that are on different offset differences. My fix just avoid adding \n at the beginning. The \n helps to separate the blocks and, with the DL models, to process the blocks in parallel, among other things.

lfoppiano commented 2 months ago

@kermitt2 I've tried to fix this a bit in a rush, at least to mitigate the issue on the docker image. I'm sorry, I might need a quick review on your side.

I've pushed this fix on the branch 0.8.1-fixes (which is a branch from the tag 0.8.1) and I've pushed an updated docker image lfoppiano/grobid:0.8.1-full which should at least mitigate this issue. It's deployed here.

kermitt2 commented 2 months ago

Hi @lfoppiano the fix works fine no problem. It is surprising that the starting "\n" has such effect on the DL processing. There's nothing else to change, the segmentation goes then normally, including parallel processing. I changed this part last December and it seems I only tested with the CRF model :) Unfortunately the end-to-end benchmarks are not covering affiliations. The docker image and the huggingface demo are also updated for the grobid account.

lfoppiano commented 2 months ago

Thanks!