jf-tech / omniparser

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
MIT License
931 stars 68 forks source link

Flatten JSON file #161

Closed andrey-moor closed 2 years ago

andrey-moor commented 2 years ago

Hi @jf-tech ,

is it possible to create a schema for flattening JSON object using current version? E.g.:

    "publishers": [
        {
            "name": "Scholastic Press",
            "books": [
                {
                    "title": "Harry Potter and the Philosopher's Stone",
                    "price": 9.99,
                    "author": "J. K. Rowling",
                    "year": 1997
                },
                {
                    "title": "Harry Potter and the Chamber of Secrets",
                    "price": 10.99,
                    "author": "J. K. Rowling",
                    "year": 1998
                }
            ]
        },
        {
            "name": "Harper & Brothers",
            "books": [
                {
                    "title": "Goodnight Moon",
                    "price": 5.99,
                    "author": "Margaret Wise Brown",
                    "year": 1947
                },
            ]
        }
    ]
}

to

[
  {
    "publisher_name": "Scholastic Press",
    "title": "Harry Potter and the Philosopher's Stone",
    "price": 9.99,
    "author": "J. K. Rowling",
    "year": 1997
  },
  {
    "publisher_name": "Scholastic Press",
    "title": "Harry Potter and the Chamber of Secrets",
    "price": 10.99,
    "author": "J. K. Rowling",
    "year": 1998
  },
  {
    "publisher_name": "Harper & Brothers",
    "title": "Goodnight Moon",
    "price": 5.99,
    "author": "Margaret Wise Brown",
    "year": 1947
  }
]
jf-tech commented 2 years ago

@andrey-moor Yes.

If you use the sample json input at: https://github.com/jf-tech/omniparser/blob/master/extensions/omniv21/samples/json/2_multiple_objects.input.json, you can use the following schema to flatten it:

{
    "parser_settings": {
        "version": "omni.2.1",
        "file_format_type": "json"
    },
    "transform_declarations": {
        "FINAL_OUTPUT": { "xpath": "/publishers/*/books/*", "object": {
            "publisher_name": { "xpath": "../../name" },
            "author": { "xpath": "author" },
            "year": { "xpath": "year", "type": "int" },
            "price": { "xpath": "price", "type": "float" },
            "title": { "xpath": "title" }
        }}
    }
}

Now if you want to filter out the book (in this case, "Brown Bear, Brown Bear, What Do You See") that doesn't have author field, you can make a slight adjustment to the xpath on the FINAL_OUTPUT by adding a filter clause:

{
    "parser_settings": {
        "version": "omni.2.1",
        "file_format_type": "json"
    },
    "transform_declarations": {
        "FINAL_OUTPUT": { "xpath": "/publishers/*/books/*[author!='']", "object": {
            "publisher_name": { "xpath": "../../name" },
            "author": { "xpath": "author" },
            "year": { "xpath": "year", "type": "int" },
            "price": { "xpath": "price", "type": "float" },
            "title": { "xpath": "title" }
        }}
    }
}

You can try the schema out on https://omniparser.herokuapp.com/ (need to wait for a few seconds before the heroku instance boots up), by selecting "json/2_multiple_objects" and then cut/paste the two different versions of schema I included above.

Let me know if you have any additional questions.

andrey-moor commented 2 years ago

Great, exactly what I needed, thanks for the quick response! I have one more question, but I'll create another issue a bit later.