metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Provide incoming string for `url` or path from `open-http`/`pen-file` as variable #533

Open TobiasNx opened 1 month ago

TobiasNx commented 1 month ago

At the moment we cannot use the incoming url-string after it is used in open-http.

A useful scenario would be if we scrape a website but the website does not provide the url as metadata and to quickly identify the source. Another would be if catching errors in a later process it could state the _id as source of the error.

There also could be a more abstract approach since this could also be useful for open-file and provide the file name as _id

e.g.: https://metafacture.org/playground/?flux=%22https%3A//phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html%22%0A%7C+open-http%28accept%3D%22application/xml%22%29%0A%7C+decode-html%0A%7C+fix%28%22copy_field%28%27_id%27%2C%27_id%27%29%22%29%0A%7C+encode-json%28prettyPrinting%3D%22true%22%29%0A%7C+print%0A%3B

Not sure where the value of _id comes from.

blackwinter commented 1 month ago

_id is the internal record identifier which is set automatically by some decoder/handler modules and which can be set manually (based on some literal value) with the change-id Flux command.

It can not be set by input modules, because they don't know anything about records at that point. OTOH, the source location (URL, path) is not available anymore when the decoder receives the stream and there is (currently) no way to transport it out-of-band. Setting the ID to the source location would also mean that (potentially) multiple records would get the same ID, so it violates the uniqueness guarantee.

It might, however, be possible to save the URL in a variable which can then be used in the transformation. Maybe along the following lines:

default inputUrl = "https://phet-dev.colorado.edu/html/build-an-atom/0.0.0-3/simple-text-only-test-page.html";

inputUrl
| open-http(accept="application/xml")
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;
TobiasNx commented 1 month ago

I would be fine with a variable that could be used in the FIX and the FLUX.

It would help in this scenario.

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

blackwinter commented 1 month ago

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

TobiasNx commented 1 month ago

I would be fine with a variable that could be used in the FIX and the FLUX.

So your initial use case is solved?

I think if I could use the variable in the fix my use case would be solved yes. :)

Nice would be also to use the variable in e.g. logging contexts or in other scenarios as variable in the FLUX, but this would be an additional feature.

I'm not sure I understand this part. Do you mean that all variables should be included whenever anything is logged? And what other contexts are you referring to?

If I could configure the logging message and add the variable to the output is one scenario where the variable could be handy. Another could be if the file-name is passed on as a variable I could use it to write a file with a given variable as name. But these are additional feature, what would be good in the first place is to have the variable available for FIX and for other FLUX Commands.

blackwinter commented 1 month ago

I think if I could use the variable in the fix my use case would be solved yes. :)

But you can. Doesn't the proposed solution work for you?

TobiasNx commented 1 month ago

ahh, i now I see the specific aspect of your approach. I tought you were suggesting that the opener-module would create the variable, but you were not.

something like this:

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| open-http(input-to-variable="inputUrl"))
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;

Instead you would define the variable beforehand.

This would not solve my usecase since you have to provide/configure the variable outside of the flux-workflow itself. The usecase would be in our scenario to use a sitemap via the sitemap reader in oersi, then open the html and fetch data. I do not know the data before hand.

Perhaps another and more general solution would be a flux-module that sets the incoming string as variable.

sitemap
| oersi.SitemapReader(wait=input_wait, limit=input_limit, urlPattern=".*/course/.*")
| string-to-variable("inputUrl")
| open-http(header=user_agent_header)
| decode-html
| fix("set_field('_id', '$[inputUrl]')", *)
| change-id
| fix("copy_field('_id', '_id')")
| encode-json(prettyPrinting="true")
| print
;