metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

Add or enhance a function to extract JSON-Records from an JSON-API #382

Closed TobiasNx closed 3 years ago

TobiasNx commented 3 years ago

While we are able to extract JSON records which are arrayed at the top level in an JSON file we are not able to extract JSON records from an JSON API that has the records in an array in an (sub-)field. At the moment we can't extract or split the records. The JSON file received via the JSON-API is extracted as one record:

"https://imoox.at/mooc/local/moochubs/classes/webservice.php"
| open-http(accept="application/json")
| as-lines //as-records has the same result
| decode-json
...

Example for file: https://imoox.at/mooc/local/moochubs/classes/webservice.php

In the field "data" there are the JSON records as objects. These objects should each be retrieved as single records.

Functional Review: @TobiasNx Code Review: @dr0i

blackwinter commented 3 years ago

I guess something like a RecordPathFilter would be actually nice to have. More or less a generic version of org.metafacture.xml.XmlElementSplitter. Maybe based on XPath or JSONPath?

...
| decode-json
| filter-records-by-path("$.data")
...
fsteeg commented 3 years ago

Implemented with a JSON path as suggested by @blackwinter, but as an option in JsonDecoder, because we have the full JSON to query the JSON path against at that point: https://github.com/metafacture/metafacture-core/commit/db8f5a1cd7f44ebe00361b544433dd6914bf81e5

TobiasNx commented 3 years ago

Hei, it works fine for that +1 Cool that this worked out so fast.

In my opinion the function code needs some documentation about the options/attributes that can be selected. For a new user this is not very instructive.

fsteeg commented 3 years ago

Assigned and requested review from @dr0i in https://github.com/metafacture/metafacture-core/pull/384, unassigning myself here.

dr0i commented 3 years ago

In my opinion the function code needs some documentation about the options/attributes that can be selected. For a new user this is not very instructive.

Updated the flux-commands.md (pending PR https://github.com/metafacture/metafacture-documentation/pull/14/files). @TobiasNx: may also be a good idea to document that option, especially the "splitting" capability, at other places of documentation - maybe to the cookbook.

blackwinter commented 3 years ago

Aside from the unfortunate fact that the proposed implementation leads to parsing the JSON document twice, it also means loading all extracted records into memory simultaneously. This is a potentially serious limitation. I assume we'd have to implement the filtering mechanism ourselves (in terms of our incremental parsing) if we wanted to avoid those downsides. In which case JSON Pointer might be the simpler specification to implement while still satisfying the current use case.

Finally, I'd still prefer a generic stream filter rather than extending each individual format decoder ad hoc whenever the need arises...

fsteeg commented 3 years ago

the proposed implementation leads to parsing the JSON document twice

Hm, does it? If the recordPath is set (if it isn't set, it retains the old behavior), the full record is parsed once, to apply the JSON path, and then each field value treated as a record is parsed once. Oh, is that what you mean, that technically each subrecord is parsed twice (once as part of the full document, once as a record on its own)?

it also means loading all extracted records into memory simultaneously

Right, but if I'm not mistaken, all that content has already been loaded into memory as a string when passed to JsonDecoder. Not as a JSON object though, so it will consume more memory. But not in the order of all records in memory vs only one record in memory.

I think there is great benefit in providing (optional) full JSON path support in our JSON decoder. It provides a very flexible mechanism to query any JSON API for records. And the performance cost seems reasonable to me.

However, I also like the idea of a generic stream filter, since it would unify different current approaches (the mentioned XML splitting, maybe also extract-element from the metafacture-html module, and this use case here). What I don't understand though, is how we would use a JSON path (or pointer, which would work for our use case here as well) at that point, where we have events, not JSON? Am I missing something here? Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process? But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

Since we need this functionality in OERSI, I'll merge the approved PR https://github.com/metafacture/metafacture-core/pull/384 for now. We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then. So feel free to reopen this or open a new issue at any time.

blackwinter commented 3 years ago

that technically each subrecord is parsed twice (once as part of the full document, once as a record on its own)?

Yes, that's what I meant.

all that content has already been loaded into memory as a string when passed to JsonDecoder.

Right, I didn't consider this. There are additional data structures/objects with your approach so both memory consumption and GC pressure increase, but not in the way I initially assumed.

Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process?

Exactly, implementing some path/filter syntax in terms of our stream events (similar, though more involved, to what XmlElementSplitter does).

But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

Indeed, I thought of that after posting my comment. We're already using (something like) this with idKey in JsonToElasticsearchBulk. It might make sense to reuse existing, even if less powerful, syntax instead of implementing a new one. (Would also be compatible with XmlElementSplitter, if I'm not mistaken.)

blackwinter commented 3 years ago

But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

FTR, that would be EntityPathTracker.getCurrentPath()/.getCurrentPathWith(), right?

fsteeg commented 3 years ago

FTR, that would be EntityPathTracker.getCurrentPath()/.getCurrentPathWith(), right?

Yes, that's what I was thinking of.

fsteeg commented 3 years ago

Created a new issue to follow up on the generic approach: https://github.com/metafacture/metafacture-core/issues/385.

blackwinter commented 3 years ago

We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then.

Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.)

JSONPath is more powerful, though, so it might still be preferable when decoding JSON.

If we decide to keep, I'd like to get rid of the List<String> in JsonDecoder.process().

fsteeg commented 3 years ago

Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.) JSONPath is more powerful, though, so it might still be preferable when decoding JSON.

So @TobiasNx tried it for our OERSI use case and it seems like #385 works here as well. At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support). I added a test case for that in https://github.com/metafacture/metafacture-core/commit/7b47a1c4822f41edb4ac58c7a13d83bd0a730b36. While it might make sense to add array support to JsonDecoder, I think this shows how versatile JsonPath support is here. So I vote for keeping this.

If we decide to keep, I'd like to get rid of the List<String> in JsonDecoder.process().

I pushed https://github.com/metafacture/metafacture-core/commit/b4e056bcd1dd45392cbca96f50a35d211586b2b1 to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant? I'll open a PR for both these changes, so we can discuss any details there.

dr0i commented 3 years ago

Another aspect is, as you pointed out in a discussion today, that the json-path introduces another dependency to core. We have the plugin concept in metafacture to avoid these bloating dependencies, but that concept seems to be rarely used. Also I don't know what that would mean here - duplicating the JsonDecoder to https://github.com/metafacture/metafacture-json-plugin? Or moving it there, introducing an API break? I don't know.

blackwinter commented 3 years ago

At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support).

StringMatcher could potentially be used for preprocessing. But I see your point.

I added a test case for that in 7b47a1c.

It might be worthwhile to include the same test without the recordPath to illustrate the default behaviour.

So I vote for keeping this.

OK.

I pushed b4e056b to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant?

Almost ;) Why does matches() have to return a list? The stream would suffice, wouldn't it?

fsteeg commented 3 years ago

Another aspect is, as you pointed out in a discussion today, that the json-path introduces another dependency to core. [...]

I don't think adding a dependency to metafacture-core is a problem per se, in particular since it has been modularized, and we only add the dependency to metafacture-json. What I meant was that if we had no use case at all, adding a feature that also introduces a dependency would be no good. But with the two different use cases we saw for using a JsonPath here, I think it's a useful addition, and worth adding a dependency.

fsteeg commented 3 years ago

Addressed comments by @blackwinter in https://github.com/metafacture/metafacture-core/commit/88d941f81f36a4ab9ea083a3ca663b91d610b8e3 and https://github.com/metafacture/metafacture-core/commit/7b978b67ccbe9acc00d02c8827cabb2a6c1616c0.