kindly / flatterer

Opinionated JSON to CSV/XLSX/SQLITE/PARQUET converter. Flattens JSON fast.
https://flatterer.opendata.coop
MIT License
180 stars 7 forks source link

Add an option, so lists of strings generate their own table? #49

Closed dslinger-a2 closed 1 year ago

dslinger-a2 commented 1 year ago

I found a potential issue when there is a list within a dictionary/object. I would think that list would get flattened as well but doesn't seem to be. Using flatterer v0.19.7 and comes up with both python v3.10.7 and v3.11.1 (on Windows 11 and in debian 11). I understand that this issue could be very complicated but wanted to bring up what I found. Here's two example json files to test with. Example where I would expect that summative_openEndedQuestions would become its own table:

[
  {
    "id": 1,
    "title": "A Game",
    "summative": {"openEndedQuestions": []}
  },
  {
    "id": 2,
    "title": "B Game",
    "summative": {"openEndedQuestions": [
      "abc", "def"
    ]}
  }
]

That results in a just a main table/csv/parquet.

And a further nested example where I would expect summative_openEndedQuestions_answers to become its own table:

[
  {
    "id": 1,
    "title": "A Game",
    "summative": {"openEndedQuestions": [
      {"answers": []}
    ]}
  },
  {
    "id": 2,
    "title": "B Game",
    "summative": {"openEndedQuestions": [
      {"answers": ["abc", "def"]}
    ]}
  }
]

That results in two tables/csv/parquet. main and summative_openEndedQuestions.

kindly commented 1 year ago

This is by design. Lists of strings do not generate their own table, only lists of objects. Lists of strings should generate comma-delimited lists of the strings of the items for that key.

A separate table was considered, but was produced by too many separate tables for the use cases I came across.

I would consider adding an option to change this behaviour if that is useful?

dslinger-a2 commented 1 year ago

Thank you for considering that! Adding that option would definitely help me! Even though that does produce a lot of tables, it's helpful that the results are consistent if one day I get that column with all empty lists (and thus not even having a column generated) and then another day that column does have some values.

kindly commented 1 year ago

@dslinger-a2

Have you thought about supplying a fields file? https://flatterer.opendata.coop/options.html#fields-file

You can just use a file that is produced by one of your outputs which has all the needed fields in.

If you provide a field in that file, it will automatically be created, regardless if it is empty in the data. The point of it is to help make sure that you get constant results when you run the command again, including field order.

dslinger-a2 commented 1 year ago

@kindly Oh! I didn't realize that the fields file could force a column to exist. I mostly thought it limited columns. That does sound like it would fix my issue! Also, just for clarification, it's anything that isn't an object that becomes a comma delimited field, right? Like a list of integers would also be treated the same.

kindly commented 1 year ago

@dslinger-a2

Not exactly, only arrays where all items are strings are treated that way.

For all other arrays, it falls back to just the JSON representation of the array in that field, so a list of integers would look like [1,2,3,4]. But it might make sense for it actually to be 1,2,3,4 as you said, I had not thought of that. The reason for the fallback to JSON is that it is basically impossible to cover all cases, as arrays do not have to have the same type. So for example an array could be ["hi", 1, null, [1], {"key": "value"}] and in that case who knows what to do! So I defaulted to just dumping the JSON itself as a fallback for all cases I do not cover.

dslinger-a2 commented 1 year ago

@kindly Totally understand that decision! In that case, I think it might be better on my end if there was an option to have all lists always become their own table. That seems both simplest and most consistent. I'm flattening to csvs at the moment and just copying everything into Redshift as text. So a mix of strings and ints would be treated just fine in their own table.

kindly commented 1 year ago

@dslinger-a2 just released a version (0.19.8) with a new option arrays as table:

https://flatterer.opendata.coop/options.html#arrays-as-table

Which I think covers what you need. It has to create a new column name which I just called value to hold the value of the items in the list.

dslinger-a2 commented 1 year ago

@kindly Wonderful! Thanks so much!!