Closed fsteeg closed 4 years ago
With HTML input support in metafacture/metafacture-core#312 and URL input support in metafacture/metafacture-fix#6, we can access the script
content with metafacture-fix:
My initial idea was to set up something like this:
"https://www.oerbw.de/..." | open-http | decode-html | fix("1.fix") | decode-json | fix("2.fix") | encode-json
That is, parse the HTML, pick out the JSON data in the first Fix (with something like the map(html.head.script.value,json)
Fix in the link above), decode that as JSON, pass it to a second Fix to pick out the fields we need, and encode the final JSON for the index. However, since the data flowing out of the first Fix would have to be an entire record, not a field, this would not exactly fit into the workflow architecture. Instead, it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:
"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json
That is, we decode an HTML document as JSON, by looking for embedded JSON in the HTML.
it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:
"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json
This looks fine. However, it should then first try to get JSON-only via accept header ($ curl -H "accept: application/json" https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902
) as this would be the ideal way if servers provide it. I don't know whether it makes sense to build those two approaches into one decode-json
command.
However, it should then first try to get JSON-only via accept header ($ curl -H "accept: application/json" https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902) as this would be the ideal way if servers provide it.
The accept header is actually a config option of the open-http
step (see https://github.com/metafacture/metafacture-core/commit/9be4ec0d818a319316e7b434df0802f29d27cfeb), so if the service supported it, we could set up the Flux like this:
"https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f" | open-http(accept="application/json") | decode-json | fix | encode-json
You can test that in http://test.lobid.org/fix with a Flux like:
"http://lobid.org/gnd/5093230-5" | open-http(accept="application/json") | as-lines
"http://lobid.org/gnd/5093230-5" | open-http(accept="text/html") | as-lines
Instead, it might make more sense to support embedded JSON in the JsonDecoder, supporting workflows like:
"https://www.oerbw.de/..." | open-http | decode-json | fix | encode-json
That is, we decode an HTML document as JSON, by looking for embedded JSON in the HTML.
When we discussed this today, @dr0i objected that this is rather confusing, as we would open an HTML document with a JSON decoder. Additionally, it would pull the jsoup dependency into the metafacture-json project. Instead, we came up with a small module that only extracts the JSON from the HTML. This could be part of metafacture-html and would have no dependency on metafacture-json. It would be used like this:
"https://www.oerbw.de/..." | open-http | extract-json | decode-json | fix | encode-json
Deployed to http://test.lobid.org/fix:
Flux:
"https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f" | open-http | extract-script | decode-json | fix | encode-json(prettyPrinting="true")
map(name, title)
map(description, description)
Output:
{ "description" : "Das Lehrvideo ist Teil der Lehrveranstaltung \"Mathematik für Designer\". In diesem 3. Lehrvideo wird die projektive Geometrie der Ebene als Grundlage von homogenen Koordinaten erklärt. Im nächsten Video wird dann erklärt, wie affine Abbildungen mit homogenen Koordinaten dargestellt werden können.", "title" : "Projektive Geometrie und Homogene Koordinaten" }
I've used extract-script
(instead of extract-json
) to have it both more generic (can get any script) and more HTML-specific (since it's part of metafacture-html). For now it always takes the first script. If we need other scripts in other examples we could easily extend the component to support an index, e.g. to get the second script: extract-script("2")
.
+1
Moved to https://gitlab.com/oersi/oersi-etl/-/issues/3. Closing.
For a scenario as in https://github.com/programmieraffe/oerhoernchen20#technical-background, looking at a resource with embedded JSON like https://www.oerbw.de/edu-sharing/components/render/4aed7529-dd02-44d0-b518-4640a8e8902f, we want to process that resource with metafacture-fix to create JSON output that can be indexed with Elasticsearch. Fixes should be configurable in a UI like http://test.lobid.org/fix.