Support for hiearchical (JSON, XML) data sources

mielvds commented 3 years ago

Is your feature request related to a problem? Please describe. Most datasets I deal with are XML or JSON based. In contrast to tabular data, "cells" can be nested, possibly with a key. Preprocessing these files into tables would be an option, but that would make validation cumbersome and unintuitive. Expectations that are specifically designed for such data (eg. "expect values in array to be in set", "expect object to have key", or "expect array length to be between 2 and 5") would be hard to implement. I can validate a cell against a JSON schema, which is extremely useful. But that would offload everything to a single, unextendable expectation, which is notation-specific. (BTW, it would be great if XSD/RelaxNG could be added, but I'll put that into another feature request)

Even if we keep the execution engines strictly tabular, the complexity of unpacking non-tabular data could be hidden in a Datasource implementation, which would already help a lot.

Describe the solution you'd like There are many ways to approach this, depending on what the ambition is. The main feature is being able to unwrap values from a JSON or XML string, preferably using the XPath and JSONPath languages that already exist. A concrete example would be:

{
  "firstName": "John",
  "lastName" : "doe",
  "age"      : 26,
  "address"  : {
    "streetAddress": "naist street",
    "city"         : "Nara",
    "postalCode"   : "630-0192"
  },
  "phoneNumbers": [
    {
      "type"  : "iPhone",
      "number": "0123-4567-8888"
    },
    {
      "type"  : "home",
      "number": "0123-4567-8910"
    }
  ]
}

# '$.phoneNumbers[:1].type' returns "iphone" in the example above

expect_column_values_to_be_in_set(column="$.phoneNumbers[:1].type", value_set=["iphone"])
expect_column_values_to_match_regex(column="$.phoneNumbers[*].number", regex="^(\([0-9]{3}\)|[0-9]{3}-)[0-9]{3}-[0-9]{4}$" )

Comments:

the name "column" would be inaccurate, but not problematic
the path expressions could also served to the Datasource constructor which can use them to for example mimick a PandasDatasource is that would make implementation simpler.

Another issue with hiearchical data formats is that they are not row-based and thus hard to process efficiently (unless we restrict ourselves to formats like https://jsonlines.org/ or http://ndjson.org/). To solve this, I can suggest adding an extra Path expression which defines an iterable subset of the data beforehand (eg. at Datasource creation time).

For example, if we are only interested in validating the phone numbers form the example above, we could create a batch kwargs from "$.phoneNumbers[*]" and the path expessions in the expectations would become relative

expect_column_values_to_match_regex(column="$.number", regex="^(\([0-9]{3}\)|[0-9]{3}-)[0-9]{3}-[0-9]{4}$" )

I'm not sure how this would fit in the current architecture and what the performance implications would be. But let's discuss and see

eugmandel commented 3 years ago

Thank you for filing this @mielvds! We are not planning an immediate action on this enhancement, but want to keep it as a place for the community to discuss this.

github-actions[bot] commented 3 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

DCastile commented 2 years ago

Looks like I'm late to the party but I can say that this functionality would be very useful for my teams work as well. We have highly nested data and need to make validations from it. Questions like does this nested array field have at least 1 entry in it. To the original authors point there are ways around this, but each require transforming the data into a shape that allows asking the question that way which could quickly get out of hand

mielvds commented 2 years ago

Hi @DCastile , good to hear there are other who see value in combining GE with nested data. Unfortunately, this doesn't seem to attract much intrest in the rest of the community...which is understandable, GE is still quite young and is focused on ML datasets.

This is still essential for my organization though, so we've parked adopting GE until this gets some traction. Luckily, Pandas now natively supports XML, which is ok as a work-around, but like you say, not ideal or sustainable.

ricardogaspar2 commented 2 years ago

Hi @mielvds 👋 I've participated in the last Superconductive meeting about GE roadmap, and one of the questions that arose was nested data. It would be great, I'd even risk saying critical, for GE growth to support nested data.

As you probably are aware, there are many new tools within the Big Data ecosystem that have support for nested data (even that for some is rather limited). One of the things that sparked my interest in GE in the early days was to solve the issue of doing Data Quality validations on Big Data. Currently, there are multiple file formats that support nested data (Parquet, ORC, Delta, etc.) and some data warehouse solutions support it too. In my company, we do data processing using scala + Spark and we use a lot of nested data (due to the ecossystem) - following a OOP paradigm. I'm keen to see GE supporting nested data.

mielvds commented 2 years ago

Hi @ricardogaspar2 That's cool! I'll follow-up on GE's progress a bit more this year.

Indeed, there are many tools with support, but AFAIK, and please correct me if I'm wrong, this is limited to the input. After the data is read, it is eventually turned into some kind of table. This simplifies processing and is therefore good for perfomance. However, you have to be aware of the the resulting data model before you can do any meaningful processing or retrieve the right set of values. And I'm pretty sure that Parquet, ORC, Delta, etc. and the tools in the Big Data ecosystem have subtle differences when it comes to this tabular data model. For a framework like GE, who wants to keep refering to data as simple as column=..., this is a potential UX nightmare :)

So nested /hierachical data is a tricky thing; a good solution finds the right balance between

a generic and predictable way to select values from nested data (hence my referral to XPath and JSONPath) that you can use when defining expectations;
maintain performance/scalability, and leverage the underlying bigdata frameworks Spark, pandas, postgres,... in an optimal way

ricardogaspar2 commented 2 years ago

Thanks for the comment @mielvds. Yeah, tabular format doesn't mean flattened. Both formats (JSON, Parquet, ORC) are read as an input by an engine like Spark for example and represented as dataframes in a tabular format, but still with nested data structures, meaning that each column can be complex data type.

Recently Spark even added a new syntax to ease some manipulations with nested data. See: https://medium.com/@fqaiser94/manipulating-nested-data-just-got-easier-in-apache-spark-3-1-1-f88bc9003827

That said, in my view, it would be natural for GE to support nested data in the future. Other tools are headed in that way as well; they need to.

I don't see it as big issue for the UI. It's challenging but feasible. When showing the first level of the table, other tools (like Presto) represent the data in nested fields as JSONs, even spark does that.

DCastile commented 2 years ago

I concur with @ricardogaspar2

Support for nested structures is native in all of the backends with the exception of sqlite, so I imagine this comes down to getting the time to implement. The differences in the original file format don't matter to GE it only interacts with the backend.

Further I think the semantics could be simple using some extension of dot notation. For simple things we can use car.make to access the make key in the car column, for more complex things that might have multiplicity due to a nested array you could extend that convention to something like cars.#.make to specify that you want the make from each car in the array.

I see this issue is closed @ricardogaspar2 @eugmandel but it seems like it might be in discussion

mielvds commented 2 years ago

I didn't mean flattened, but the top level is basically a table (dataframe) and the cells can be nested in some specific way. But I think we're in agreement here.

I'm also not worried about the UI, I just wonder about what path syntax to use when declaring expectations. A dot syntax is ok for simple cases, but when you want to, for instance, address attributes in XML, it already gets tricky (granted, I dont't think XML should be a priority here). My point is that path expressions to hierachical data are easily underestimated and highly depend on how complex you want paths to be (I agree we should start simple). Also, there is a lot existing stuff already out there, like Postgres JSON operators, JSONPath or the approach @ricardogaspar2 just shared (most of which tailored to the format, because coming up with something expressive and generic is difficult). No need to reinvent the wheel or not to be inspired by it.

Anyway, reopening this issue would be a first start and let's see how the GE thinks about this.

great-expectations / great_expectations

Support for hiearchical (JSON, XML) data sources #2231