kindly / flatterer

Opinionated JSON to CSV/XLSX/SQLITE/PARQUET converter. Flattens JSON fast.
https://flatterer.opendata.coop
MIT License
184 stars 7 forks source link

Feedbacks and Issues #55

Closed nikhil-aggarwal closed 6 months ago

nikhil-aggarwal commented 1 year ago

While browsing, I stumbled upon this tool. My initial idea was to create something similar, but I figured I'd explore first to see if anyone had already tackled this. To my astonishment, this tool seems tailor-made for Data Engineers like us, addressing the complexities of reformatting. I wanted to extend my gratitude for developing this elegant solution and making it universally applicable. There are few bugs/feedbacks I would like to share if these can be addressed:

  1. Issue --> I tried this with AWS lambda, if I want my output in dataframe, it still expects a output directory. If not provided, it is taking it in /temp folder. With lambda, its not permitted to use /temp like this. Workaround: I passed path of different temp directory to overcome this for now.
  2. Issue --> Pushdown feature - It is working only when json is stored as json array otherwise pushdown doesn't work
  3. Feature --> Limit depth of iteration
  4. Feature --> Pass limited objects to consider
  5. Feature --> Exclude few objects within json
  6. Feature --> Link file name in each table, so that it can be considered as key when we are appending data into same directory
  7. Feature --> output directory with partitioning option for each table. So, that I can keep adding data as per partition strategy defined.
  8. Feature --> More performant. I guess if dataframe=True, in this case also you are using csv as intermediary step which could be optimised.
  9. Issue --> UI does not work for big JSON. it does not show any result. These are few of my observations so far of using this tool. I am still figuring out this tool in one of our pipeline and will keep sharing my observations. Once again, Thank You for putting efforts in creating this.
kindly commented 1 year ago

Thanks for the feedback.

1. Issue --> I tried this with AWS lambda, if I want my output in dataframe, it still expects a output directory. If not provided, it is taking it in /temp folder. With lambda, its not permitted to use /temp like this. Workaround: I passed path of different temp directory to overcome this for now.

As the main program is written in rust, it makes it difficult to insert into dataframes directly. So some intermediate representation is needed to be stored somewhere, like a CSV. I possibly could if /tmp has an error try a tmp file in the current directory.

2. Issue --> Pushdown feature - It is working only when json is stored as json array otherwise pushdown doesn't work

I am surprised this does not work for all inputs. Could you possibly give an example of where this does not work?

3. Feature --> Limit depth of iteration

This should be fairly easy to add.

4. Feature --> Pass limited objects to consider

I am not sure exactly what you mean by this.

There are a couple of options related to this

5. Feature --> Exclude few objects within json

Supplying an iterator and manipulating the input json is a way to achieve this. Also, fields.csv and tables.csv can be supplied to limit the output tables.

6. Feature --> Link file name in each table, so that it can be considered as key when we are appending data into same directory

Good idea. I have done this by using the iterator flatten, but a feature to do this makes sense.

7. Feature --> output directory with partitioning option for each table. So, that I can keep adding data as per partition strategy defined.

This makes sense, especially for parquet files. I have been looking at various partitioning schemes for datalakes and was hoping to use an established one at some point, but the maturity of the rust implementations were not great when I looked.

8. Feature --> More performant. I guess if dataframe=True, in this case also you are using csv as intermediary step which could be optimised.

As it is written in rust it is hard to insert straight into a dataframe. I considered polars for this. There is a memory option in the rust library that put everything into memory, but I have not looked at the cost of transferring large amounts of memory back to python.

9. Issue --> UI does not work for big JSON. it does not show any results.
   These are few of my observations so far of using this tool. I am still figuring out this tool in one of our pipeline and will keep sharing my observations.

The website mentions it is only for small data as it web assembly (WASM) based, so runs in the browser (and memory) entirely. I have been looking for ways to store the data onto disk to handle larger files, but the webAPIs are fairly restrictive.

nikhil-aggarwal commented 1 year ago

Thanks for the feedback.

1. Issue --> I tried this with AWS lambda, if I want my output in dataframe, it still expects a output directory. If not provided, it is taking it in /temp folder. With lambda, its not permitted to use /temp like this. Workaround: I passed path of different temp directory to overcome this for now.

As the main program is written in rust, it makes it difficult to insert into dataframes directly. So some intermediate representation is needed to be stored somewhere, like a CSV. I possibly could if /tmp has an error try a tmp file in the current directory.

2. Issue --> Pushdown feature - It is working only when json is stored as json array otherwise pushdown doesn't work

I am surprised this does not work for all inputs. Could you possibly give an example of where this does not work?

Nikhil --> I need to pass json as an array as below instead of plain json within {}

[{ "id": 1, "title": "A Film",  "type": "film"}]
Instead of 
{ "id": 1, "title": "A Film",  "type": "film"}
3. Feature --> Limit depth of iteration

This should be fairly easy to add.

Nikhil --> Is it? That will open more doors for multiple usecases.

4. Feature --> Pass limited objects to consider

I am not sure exactly what you mean by this.

There are a couple of options related to this

Nikhil --> Just checked JSONPath filter option and this looks interesting. Do we also have an option to use delimiter and object name directly? Like type = 'film' or address.type = ''home"

5. Feature --> Exclude few objects within json

Supplying an iterator and manipulating the input json is a way to achieve this. Also, fields.csv and tables.csv can be supplied to limit the output tables.

Nikhil --> Assume, I want to ignore any single field from json which can be inside nested object. Can there be a way to pass such objects as an array so that these will get excluded from final result?

6. Feature --> Link file name in each table, so that it can be considered as key when we are appending data into same directory

Good idea. I have done this by using the iterator flatten, but a feature to do this makes sense.

7. Feature --> output directory with partitioning option for each table. So, that I can keep adding data as per partition strategy defined.

This makes sense, especially for parquet files. I have been looking at various partitioning schemes for datalakes and was hoping to use an established one at some point, but the maturity of the rust implementations were not great when I looked.

8. Feature --> More performant. I guess if dataframe=True, in this case also you are using csv as intermediary step which could be optimised.

As it is written in rust it is hard to insert straight into a dataframe. I considered polars for this. There is a memory option in the rust library that put everything into memory, but I have not looked at the cost of transferring large amounts of memory back to python.

9. Issue --> UI does not work for big JSON. it does not show any results.
   These are few of my observations so far of using this tool. I am still figuring out this tool in one of our pipeline and will keep sharing my observations.

The website mentions it is only for small data as it web assembly (WASM) based, so runs in the browser (and memory) entirely. I have been looking for ways to store the data onto disk to handle larger files, but the webAPIs are fairly restrictive.

I appreciate your quick responses on the suggestions. Are you using any tracker for tracking all new features/enhancements for this?

nikhil-aggarwal commented 1 year ago

Feature - Can we generate SQL/Dataframe for linking together all the tables those are generated using this framework? >>```
This can add lot of advantage and help simplifying analysis.
I tried creating a UI which shows linking for all tables (in my case those were 260+) generated using this framework and it was interesting to demo as people were able to visualise the complexity involved in the json we were working with. 
SQL was generated dynamically (not complete but for demo purpose only) which I think can be generalised and tagged with this tool too.
Attaching for reference. (click on 'Categorise' then select nodes it will get marked in green and then click 'Generate SQL')
[UI.zip](https://github.com/kindly/flatterer/files/13248051/UI.zip)
kindly commented 6 months ago

Going to close this issue. Happy for you add any individual feature requests as new issues.