openeduhub / metalookup

Provide metadata about domains w.r.t accessibility, licencing, adds, etc.
GNU General Public License v3.0
5 stars 0 forks source link

Fulltext generation from HTML Content #146

Closed MRuecklCC closed 7 months ago

MRuecklCC commented 2 years ago

One big problem when scraping content, is that the fulltext is often not trivially accessible.

Instead, we have the whole DOM HTML document. One approach would be to use the html roles: https://wiki.selfhtml.org/wiki/HTML/Attribute/role

One approach would be to check if a "main content role is available and remove everything else. Alternatively one can start removing elements that are of different roles (like navigation bar, footer, etc).

MRuecklCC commented 2 years ago

Jörg from @yovisto found the amazing python package travilatura which seems to do a pretty good job at solving this issue.

It also provides other metadata like

which we could all expose via metalookup.

lummerland commented 7 months ago

Resolved with https://github.com/openeduhub/text-extraction