Closed lfoppiano closed 2 months ago
After a few iteration over it, I think I understood the principle which is of separating blocks of affiliations that are on different offset differences. My fix just avoid adding \n
at the beginning. The \n
helps to separate the blocks and, with the DL models, to process the blocks in parallel, among other things.
@kermitt2 I've tried to fix this a bit in a rush, at least to mitigate the issue on the docker image. I'm sorry, I might need a quick review on your side.
I've pushed this fix on the branch 0.8.1-fixes
(which is a branch from the tag 0.8.1
) and I've pushed an updated docker image lfoppiano/grobid:0.8.1-full
which should at least mitigate this issue. It's deployed here.
Hi @lfoppiano the fix works fine no problem. It is surprising that the starting "\n" has such effect on the DL processing. There's nothing else to change, the segmentation goes then normally, including parallel processing. I changed this part last December and it seems I only tested with the CRF model :) Unfortunately the end-to-end benchmarks are not covering affiliations. The docker image and the huggingface demo are also updated for the grobid account.
Thanks!
This PR propose a fix for the affiliation, that are lost when processing them with a DL model.
The issue seems to be in the method:
getAffiliationBlocksFromSegments()
where new\n
are added (in general they should be added if there is a misalignment, however they are added for sure at the beginning).https://github.com/kermitt2/grobid/blob/a95d2533f1019e900b49ea5c39a5afe355dbb4a3/grobid-core/src/main/java/org/grobid/core/engines/AffiliationAddressParser.java#L81
I patched quickly by checking that
end
is not zero. However this\n
does not work well with the DL models, at contrary with the CRF models that they are ignoring it.I've left two tests which are showing the problem from both CRF and DL: https://github.com/kermitt2/grobid/blob/bd93a61f4542f218299e2c34a82c37b75bc727ef/grobid-core/src/test/java/org/grobid/core/engines/AffiliationAddressParserTest.java#L262
The DL test is still failing, as I'm not sure really where to fix the issue.
After this is fix we would need to rebuild the grobid-full image.