Using on Databricks - Githubissues

laukikpatil commented 11 months ago

The library is able to parse out JSON documents with no issues on databricks. However, I get an error when I try to use the fields_csv and only_fields parameters. I am getting the below error.

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-a1667dce-2743-49e2-8f1e-82ca5c50677f/lib/python3.10/site-packages/flatterer/init.py:153, in flatten(input, output_dir, csv, xlsx, sqlite, parquet, dataframe, path, main_table_name, emit_obj, ndjson, json_stream, force, fields_csv, only_fields, tables_csv, only_tables, inline_one_to_one, schema, id_prefix, table_prefix, path_separator, schema_titles, sqlite_path, preview, threads, files, log_error, postgres, postgres_schema, drop, pushdown, sql_scripts, evolve, no_link, stats, low_disk, gzip_input, json_path, arrays_new_table) 150 if s3: 151 raise AttributeError("s3 output not available when supplying an iterator") --> 153 iterator_flatten_rs(bytes_generator(input), output_dir, csv, xlsx, sqlite, parquet, 154 main_table_name, tables_csv, only_tables, fields_csv, only_fields, 155 inline_one_to_one, path_separator, preview, 156 table_prefix, id_prefix, emit_obj, force,
157 schema, schema_titles, sqlite_path, threads, log_error, 158 postgres, postgres_schema, drop, pushdown, sql_scripts, evolve, 159 no_link, stats, low_disk, gzip_input, json_path, arrays_new_table) 160 else: 161 raise AttributeError("input needs to be a string or a generator of strings, dicts or bytes")

RuntimeError: sending on a disconnected channel

kindly commented 11 months ago

@laukikpatil

I imagine this issue is not just on Databricks, as there could be bugs with fields_csv and only_fields for certain inputs.

Do you think you could provide as much of the following as possible: (I understand some data is private, so may not be possible)

some sample data where this fails
the fields_csv you supplied where you get a failure
possibly the fields_csv file that gets produced if you just run the command without the fields_csv or only_fields
the exact arguments used to flatten

It will be very difficult to diagnose without these.

There are definitely cases where removing some fields in fields_csv and using only_fields mean that the table structure created makes no sense any more (i.e when removing intermediate tables in the schema). These will be unavoidable, but I am more concerned with the error message not explaining that this is the case.

The above error says that the receiver thread died without really reporting why, which is not good. It looks like you are supplying some kind of python iterator e.g a list or a generator. If possible, you could also try supplying a file instead (even as just an experiment) as these are likely to produce better error messages.

kindly commented 10 months ago

closing as hard to diagnose without any further information.

kindly / flatterer

Using on Databricks #51