Closed de-code closed 4 years ago
Thank you @de-code for the error cases, which are very useful!
The remaining line numbers not filtered by pdfalto don't appear to affect Grobid. When running these PDF, all the titles are correct because the lone remaining line number is neutralized by the header model. One exception is the third example, in particular the abstract, where some lower line numbers still appear in the text (it's strange that the whole line column is not filtered out in pdfalto given how it works now, but it might be related to other problems that impact the current mechanism).
This issue would be rather for pdfalto, now that line numbers are entirely tackled by pdfalto, not by Grobid.
If you find more, don't hesitate to share them, that will be very helpful to drive the next work iteration on pdfalto !
This issue would be rather for pdfalto, now that line numbers are entirely tackled by pdfalto, not by Grobid.
I did indeed intend to create the issue against pdfalto but didn't pay enough attention. I moved it over https://github.com/kermitt2/pdfalto/issues/101
Hi @kermitt2
I have now merged with upstream master and during evaluation I found some error cases where the line numbers are not filtered out.
I can confirm that the line numbers are removed for the example that @lfoppiano was using: https://doi.org/10.1101/2020.04.21.054221 (i.e. it looks like I am doing at least something right).
Here are some examples where it doesn't seem to work. It appears that the first line number (
1
is not removed), but subsequent line numbers appear to be removed (I currently don't have a way to visualise the lxml for confirm that more easily). Thus the title is usually affected more.Example 1
https://www.biorxiv.org/content/10.1101/210401v1?versioned=true
Example 2
https://doi.org/10.1101/440115
Example 3
https://doi.org/10.1101/434563