Open paregorios opened 9 years ago
Most descriptions are much cleaner now, but this particular problem still persists for some resources even after completion of #60. I think it's a function of the HTML structure of the source data, but I'm not confident I can detect that difference from good stuff easily and reliably. Repetition is occurring (only) across groups of multiple lines/sentences, which is why it is not getting caught by the current code, which first looks for and suppresses repetition in adjacent lines and then in adjacent sentences.
Unclear how to address this. Putting it back on the backlog for now.
FYI @sfsheath
Nosetest for Il capitale culturale has been left failing to reflect this issue.
even after #54! e.g. Acta historica et archaeologica mediaevalia (hash: 9ce707b4d63a218fd3b6bd28cda5b85d47db4b3f)