Open jameshowison opened 4 years ago
Yes it's for me :)
The problem comes from the sentence segmenter (pySBD), these long paragraphs are not segmented at all. It's very likely a problem of robustness of the segmenter which is rule-based, when the input is a bit "noisy".
In the 4 cases, 3 out of 4 are due to the text starting but a .
:
PMC4176174.json: . 3D reconstruction of the EC. (A) 3D reconstruction...
PMC2963829.json: . Here, we show that acute exposure to 10...
PMC3328383.json: . Initiation events occur throughout the subtelomere...
If we remove the starting .
, the segmentation works fine.
For the second one, I don't know...
I focused on average performance of the sentence segmenter when selecting one, not on the "scale" of error when it fails. It's interesting to see that nltk
for instance works fine for all these noisy cases.
I am testing the usage of nltk as fallback segmenter when pyBSD is obviously failing as for these cases. It should avoid all these long sentences >2000 characters.
This is implemented with NTLK fallback for sentences >1500 characters (after looking at the sentences, 2000 was too much).
Great! I'll run the chunking code later today, bet things improve :)
On Sun, Aug 30, 2020 at 4:42 PM Patrice Lopez notifications@github.com wrote:
This is implemented with NTLK fallback for sentences >1500 characters (after looking at the sentences, 2000 was too much).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/667#issuecomment-683474248, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUXD3CBMRWPZPJQ2SODSDLBU3ANCNFSM4QOO63IQ .
Probably one for @kermitt2 :)
I found some very long text elements in the sentence level json files (like > 3000 characters).
e.g.,
file:line "quote to search"
PMC4176174.json:1502 "3D reconstruction of the EC. (A) 3D reconstruction of the yeast MtRNAP EC generated ..." PMC3140372.json:763 " was conducted in summer 2008 for the first time. It" PMC2963829.json:1078 "which is the concentration we use for acute treatments, has a negligible eff ect" PMC3328383.json:1218 " events occur throughout the human 11q segment and forks move across the segment"
There are a bunch between 2000 and 3000, which also seem long, probably the same issue. It's not crucial, but they result in some really large tasks for tagworks (which could probably be dropped).