hbz / oerindex

Moved to https://gitlab.com/oersi/oersi-etl/
Apache License 2.0
0 stars 0 forks source link

Prototype: sitemap to index #4

Closed fsteeg closed 4 years ago

fsteeg commented 4 years ago

Following up on https://github.com/hbz/oerindex/issues/2#issuecomment-583291640: Given a sitemap URL like https://www.hoou.de/sitemap.xml, a URL prefix like https://www.hoou.de/materials/, an Elasticsearch index location URL like http://localhost:9200, a Flux like:

open-http | decode-html | fix | encode-json

And a Fix like:

map(html.head.title.value, title)
map(html.body.div.div.div.div.div.div.div.p.value, description)

We want to:

The long term idea is to do all of this in a single new UI. For this prototype, I suggest we provide the config parameters (sitemap, prefix, index, flux, fix) in some kind of config file and run the workflow from the command line. The Flux and the Fix can be tested in http://test.lobid.org/fix.

fsteeg commented 4 years ago

Basically using Flux as the config file for all the perameters would look something like this:

"https://www.hoou.de/sitemap.xml"
| list-sitemap(prefix="https://www.hoou.de/materials/")
| open-http
| decode-html
| fix("
  map(html.head.title.value, title)
  map(html.body.div.div.div.div.div.div.div.p.value, description)")
| encode-json
| index-elasticsearch("http://localhost:9200/")

This would require new list-sitemap and index-elasticsearch metafacture modules.

fsteeg commented 4 years ago

We just discussed this as a fast way to a first index: