Closed MRuecklCC closed 7 months ago
Jörg from @yovisto found the amazing python package travilatura which seems to do a pretty good job at solving this issue.
It also provides other metadata like
which we could all expose via metalookup.
Resolved with https://github.com/openeduhub/text-extraction
One big problem when scraping content, is that the fulltext is often not trivially accessible.
Instead, we have the whole DOM HTML document. One approach would be to use the html roles: https://wiki.selfhtml.org/wiki/HTML/Attribute/role
One approach would be to check if a "main content role is available and remove everything else. Alternatively one can start removing elements that are of different roles (like navigation bar, footer, etc).