hbz / oerindex

Moved to https://gitlab.com/oersi/oersi-etl/
Apache License 2.0
0 stars 0 forks source link

Process embedded JSON in HTML with metafacture-fix #3

Closed fsteeg closed 4 years ago

fsteeg commented 4 years ago

For a scenario as in https://github.com/programmieraffe/oerhoernchen20#technical-background, looking at a resource with embedded JSON like https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f, we want to process that resource with metafacture-fix to create JSON output that can be indexed with Elasticsearch. Fixes should be configurable in a UI like http://test.lobid.org/fix.

fsteeg commented 4 years ago

With HTML input support in metafacture/metafacture-core#312 and URL input support in metafacture/metafacture-fix#6, we can access the script content with metafacture-fix:

http://test.lobid.org/fix/xtext-service/run?flux="https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f"|open-http|decode-html|fix|encode-json(prettyPrinting="true")&fix=map(html.head.script.value,json)&data=

My initial idea was to set up something like this:

"https://www.oerbw.de/..." | open-http | decode-html | fix("1.fix") | decode-json | fix("2.fix") | encode-json

That is, parse the HTML, pick out the JSON data in the first Fix (with something like the map(html.head.script.value,json) Fix in the link above), decode that as JSON, pass it to a second Fix to pick out the fields we need, and encode the final JSON for the index. However, since the data flowing out of the first Fix would have to be an entire record, not a field, this would not exactly fit into the workflow architecture. Instead, it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:

"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json

That is, we decode an HTML document as JSON, by looking for embedded JSON in the HTML.

acka47 commented 4 years ago

it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:

"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json

This looks fine. However, it should then first try to get JSON-only via accept header ($ curl -H "accept: application/json" https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902) as this would be the ideal way if servers provide it. I don't know whether it makes sense to build those two approaches into one decode-json command.

fsteeg commented 4 years ago

However, it should then first try to get JSON-only via accept header ($ curl -H "accept: application/json" https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902) as this would be the ideal way if servers provide it.

The accept header is actually a config option of the open-http step (see https://github.com/metafacture/metafacture-core/commit/9be4ec0d818a319316e7b434df0802f29d27cfeb), so if the service supported it, we could set up the Flux like this:

"https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f" | open-http(accept="application/json") | decode-json | fix | encode-json

You can test that in http://test.lobid.org/fix with a Flux like:

"http://lobid.org/gnd/5093230-5" | open-http(accept="application/json") | as-lines

"http://lobid.org/gnd/5093230-5" | open-http(accept="text/html") | as-lines

fsteeg commented 4 years ago

Instead, it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like: "https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json That is, we decode an HTML document as JSON, by looking for embedded JSON in the HTML.

When we discussed this today, @dr0i objected that this is rather confusing, as we would open an HTML document with a JSON decoder. Additionally, it would pull the jsoup dependency into the metafacture-json project. Instead, we came up with a small module that only extracts the JSON from the HTML. This could be part of metafacture-html and would have no dependency on metafacture-json. It would be used like this:

"https://www.oerbw.de/..." | open-http | extract-json | decode-json | fix | encode-json

fsteeg commented 4 years ago

Deployed to http://test.lobid.org/fix:

Flux:

"https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f" | open-http | extract-script | decode-json | fix | encode-json(prettyPrinting="true")

map(name, title)
map(description, description)

Output:

{ "description" : "Das Lehrvideo ist Teil der Lehrveranstaltung \"Mathematik für Designer\". In diesem 3. Lehrvideo wird die projektive Geometrie der Ebene als Grundlage von homogenen Koordinaten erklärt. Im nächsten Video wird dann erklärt, wie affine Abbildungen mit homogenen Koordinaten dargestellt werden können.", "title" : "Projektive Geometrie und Homogene Koordinaten" }

I've used extract-script (instead of extract-json) to have it both more generic (can get any script) and more HTML-specific (since it's part of metafacture-html). For now it always takes the first script. If we need other scripts in other examples we could easily extend the component to support an index, e.g. to get the second script: extract-script("2").

acka47 commented 4 years ago

+1

acka47 commented 4 years ago

Moved to https://gitlab.com/oersi/oersi-etl/-/issues/3. Closing.