Closed fsteeg closed 4 years ago
Basically using Flux as the config file for all the perameters would look something like this:
"https://www.hoou.de/sitemap.xml"
| list-sitemap(prefix="https://www.hoou.de/materials/")
| open-http
| decode-html
| fix("
map(html.head.title.value, title)
map(html.body.div.div.div.div.div.div.div.p.value, description)")
| encode-json
| index-elasticsearch("http://localhost:9200/")
This would require new list-sitemap
and index-elasticsearch
metafacture modules.
We just discussed this as a fast way to a first index:
url
, title
, description
, and license
, index in Elasticsearch
Following up on https://github.com/hbz/oerindex/issues/2#issuecomment-583291640: Given a sitemap URL like https://www.hoou.de/sitemap.xml, a URL prefix like
https://www.hoou.de/materials/
, an Elasticsearch index location URL like http://localhost:9200, a Flux like:open-http | decode-html | fix | encode-json
And a Fix like:
We want to:
url
loc
starts with the given prefixThe long term idea is to do all of this in a single new UI. For this prototype, I suggest we provide the config parameters (
sitemap
,prefix
,index
,flux
,fix
) in some kind of config file and run the workflow from the command line. The Flux and the Fix can be tested in http://test.lobid.org/fix.