isawnyu / isaw.awol

Awol blog python code
Other
1 stars 3 forks source link

repetition still occurs in some descriptions #57

Open paregorios opened 9 years ago

paregorios commented 9 years ago

even after #54! e.g. Acta historica et archaeologica mediaevalia (hash: 9ce707b4d63a218fd3b6bd28cda5b85d47db4b3f)

paregorios commented 9 years ago

Most descriptions are much cleaner now, but this particular problem still persists for some resources even after completion of #60. I think it's a function of the HTML structure of the source data, but I'm not confident I can detect that difference from good stuff easily and reliably. Repetition is occurring (only) across groups of multiple lines/sentences, which is why it is not getting caught by the current code, which first looks for and suppresses repetition in adjacent lines and then in adjacent sentences.

Unclear how to address this. Putting it back on the backlog for now.

FYI @sfsheath

paregorios commented 9 years ago

Nosetest for Il capitale culturale has been left failing to reflect this issue.