[FEA] Add a column with filenames index in cudf.read_json

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.31k stars 887 forks source link

[FEA] Add a column with filenames index in cudf.read_json #15960

Open miguelusque opened 3 months ago

miguelusque commented 3 months ago

Hi!

cudf.read_json supports passing multiple files to it, which is much more performant than reading json files individually, and then merging them.

It would be very useful for certain workloads to add a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.

I would suggest adding a new input parameter, named something similar to input_file_indexes_series_name with a default value of None, and, when populated with a string, it would indicate that the indexes of the input files passed to cudf.read_json should be added to a column named as detailed in input_file_indexes_series_name parameter.

Already discussed with @vuule.

Thanks!

P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features. P.S.: This FR was originally raised here.

brandon-b-miller commented 3 months ago

Hi @miguelusque , Thanks for the feature request. This would be a superset of the pandas read_json API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name() but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.

GregoryKimball commented 3 months ago

Thank you @miguelusque for opening up a follow-on issue about this topic. @shrshi, you've had a lot of success in the multi-source improvements you added to https://github.com/rapidsai/cudf/pull/15930. Would you please share your thoughts about the scope for optional source index tracking as an item for future work?

miguelusque commented 3 months ago

Hi @miguelusque , Thanks for the feature request. This would be a superset of the pandas read_json API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name() but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.

Hi @brandon-b-miller ,

Indeed. That would be a new feature not present in Pandas.

Please, let me mention that when we discussed this feature internally, I think we agreed that the most efficient was to only add a column containing the indexes corresponding the files passed to read_json method, and let the user concatenate the file names if needed.

I am happy with a more elaborated API, where you can decide if adding the file names or the file name indexes. The minimum request from our side is to have at least the file name indexes, in order to generate the column with the names by ourselves.

GregoryKimball commented 3 months ago

@miguelusque could it also work to expose a metadata item that includes the row count per data source?

miguelusque commented 3 months ago

Hi Gregory, it would depend on how much it would cost, in terms of performance, to reconstruct the dataframe in the desire format. If there is an efficient way to do it from the metadata, that is fine for me.

mhaseeb123 commented 3 months ago

We have a similar request for Parquet reader at #15389. We are thinking of adding a vector to table_metadata reporting the number of rows read from each data source unless AST row selection filters are being used in which case, an empty vector is returned due to the added computational overhead. @karthikeyann @shrshi