hbz / oerindex

Moved to https://gitlab.com/oersi/oersi-etl/
Apache License 2.0
0 stars 0 forks source link

Process HTML DOM elements with metafacture-fix #2

Closed fsteeg closed 4 years ago

fsteeg commented 4 years ago

For a scenario as in https://github.com/programmieraffe/oerhoernchen20#technical-background, looking at a sitemap like https://www.hoou.de/sitemap.xml, finding OER materials like https://www.hoou.de/materials/tutorial-lernen-lernen, we want to process that resource with metafacture-fix to create JSON output that can be indexed with Elasticsearch. Fixes should be configurable in a UI like http://test.lobid.org/fix.

fsteeg commented 4 years ago

With HTML input support in https://github.com/metafacture/metafacture-core/issues/312 and URL input support in https://github.com/metafacture/metafacture-fix/issues/6, we can use metafacture-fix to convert the full DOM structure of something like https://www.hoou.de/materials/tutorial-lernen-lernen to JSON:

http://test.lobid.org/fix/xtext-service/run?flux="https://www.hoou.de/materials/tutorial-lernen-lernen"|open-http|decode-html|fix|encode-json(prettyPrinting="true")&fix=map(_else)&data=

To pick out just the title and the description, in http://test.lobid.org/fix, we can use a Fix like:

map(html.head.title.value, title)
map(html.body.div.div.div.div.div.div.div.p.value, description)

With the Flux from the link above:

"https://www.hoou.de/materials/tutorial-lernen-lernen"|open-http|decode-html|fix|encode-json(prettyPrinting="true")

We get some concise JSON back:

{ "title" : "Tutorial: Lernen lernen - HOOU", "description" : "Das Bewusstsein und die Kenntnis über Ihren Lernstil kann Ihnen helfen, Ihren Lernansatz und damit auch den Lernerfolg zu optimieren. In diesem Modul reflektieren Sie Ihren Lernstil und dessen Implikationen und entwickeln individuelle Lernstrategien. Zudem hilft Ihnen das Wissen über unterschiedliche Lernstile beim Lernen in der Gruppe oder bei der Teamarbeit." }

So this basically works. However, the html.body.div.div.div.div.div.div.div.p.value is problematic: the internal structure might change, requiring changes to the Fix. It would be better to have support for conditionals in the Fix, and use the description property of the html.head.meta.content, see https://github.com/metafacture/metafacture-fix/issues/10.

fsteeg commented 4 years ago

Both this and https://github.com/hbz/oerindex/issues/3 basically work (we get a title and a description). Maybe it makes sense continue with the bigger picture (collecting sources from the sitemap.xml, indexing the results) instead of improving the way we extract the description at this point?

acka47 commented 4 years ago

Maybe it makes sense continue with the bigger picture (collecting sources from the sitemap.xml, indexing the results) instead of improving the way we extract the description at this point?

+1

acka47 commented 4 years ago

Moved to https://gitlab.com/oersi/oersi-etl/-/issues/2. Closing.