kindly / flatterer

Opinionated JSON to CSV/XLSX/SQLITE/PARQUET converter. Flattens JSON fast.
https://flatterer.opendata.coop
MIT License
180 stars 7 forks source link

Using on Databricks #51

Closed laukikpatil closed 10 months ago

laukikpatil commented 11 months ago

The library is able to parse out JSON documents with no issues on databricks. However, I get an error when I try to use the fields_csv and only_fields parameters. I am getting the below error.

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a1667dce-2743-49e2-8f1e-82ca5c50677f/lib/python3.10/site-packages/flatterer/init.py:153, in flatten(input, output_dir, csv, xlsx, sqlite, parquet, dataframe, path, main_table_name, emit_obj, ndjson, json_stream, force, fields_csv, only_fields, tables_csv, only_tables, inline_one_to_one, schema, id_prefix, table_prefix, path_separator, schema_titles, sqlite_path, preview, threads, files, log_error, postgres, postgres_schema, drop, pushdown, sql_scripts, evolve, no_link, stats, low_disk, gzip_input, json_path, arrays_new_table) 150 if s3: 151 raise AttributeError("s3 output not available when supplying an iterator") --> 153 iterator_flatten_rs(bytes_generator(input), output_dir, csv, xlsx, sqlite, parquet, 154 main_table_name, tables_csv, only_tables, fields_csv, only_fields, 155 inline_one_to_one, path_separator, preview, 156 table_prefix, id_prefix, emit_obj, force,
157 schema, schema_titles, sqlite_path, threads, log_error, 158 postgres, postgres_schema, drop, pushdown, sql_scripts, evolve, 159 no_link, stats, low_disk, gzip_input, json_path, arrays_new_table) 160 else: 161 raise AttributeError("input needs to be a string or a generator of strings, dicts or bytes")

RuntimeError: sending on a disconnected channel

kindly commented 11 months ago

@laukikpatil

I imagine this issue is not just on Databricks, as there could be bugs with fields_csv and only_fields for certain inputs.

Do you think you could provide as much of the following as possible: (I understand some data is private, so may not be possible)

It will be very difficult to diagnose without these.

There are definitely cases where removing some fields in fields_csv and using only_fields mean that the table structure created makes no sense any more (i.e when removing intermediate tables in the schema). These will be unavoidable, but I am more concerned with the error message not explaining that this is the case.

The above error says that the receiver thread died without really reporting why, which is not good. It looks like you are supplying some kind of python iterator e.g a list or a generator. If possible, you could also try supplying a file instead (even as just an experiment) as these are likely to produce better error messages.

kindly commented 10 months ago

closing as hard to diagnose without any further information.