Open miguelusque opened 3 months ago
Hi @miguelusque ,
Thanks for the feature request. This would be a superset of the pandas read_json
API, correct? Other dataframe-like tools like spark have similar functionality such as input_file_name()
but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.
Thank you @miguelusque for opening up a follow-on issue about this topic. @shrshi, you've had a lot of success in the multi-source improvements you added to https://github.com/rapidsai/cudf/pull/15930. Would you please share your thoughts about the scope for optional source index tracking as an item for future work?
Hi @miguelusque , Thanks for the feature request. This would be a superset of the pandas
read_json
API, correct? Other dataframe-like tools like spark have similar functionality such asinput_file_name()
but I think we want to consider our API carefully any time we are creating string columns in a limited memory environment.
Hi @brandon-b-miller ,
Indeed. That would be a new feature not present in Pandas.
Please, let me mention that when we discussed this feature internally, I think we agreed that the most efficient was to only add a column containing the indexes corresponding the files passed to read_json
method, and let the user concatenate the file names if needed.
I am happy with a more elaborated API, where you can decide if adding the file names or the file name indexes. The minimum request from our side is to have at least the file name indexes, in order to generate the column with the names by ourselves.
@miguelusque could it also work to expose a metadata item that includes the row count per data source?
Hi Gregory, it would depend on how much it would cost, in terms of performance, to reconstruct the dataframe in the desire format. If there is an efficient way to do it from the metadata, that is fine for me.
We have a similar request for Parquet reader at #15389. We are thinking of adding a vector to table_metadata
reporting the number of rows read from each data source unless AST row selection filters are being used in which case, an empty vector is returned due to the added computational overhead. @karthikeyann @shrshi
Hi!
cudf.read_json
supports passing multiple files to it, which is much more performant than reading json files individually, and then merging them.It would be very useful for certain workloads to add a column containing the index to the files passed to cudf.read_json method that indicates to which file corresponds each row in the dataset.
I would suggest adding a new input parameter, named something similar to
input_file_indexes_series_name
with a default value of None, and, when populated with a string, it would indicate that the indexes of the input files passed tocudf.read_json
should be added to a column named as detailed ininput_file_indexes_series_name
parameter.Already discussed with @vuule.
Thanks!
P.S.: NeMo Curator may benefit of this FR when performing Exact Deduplication, Fuzzy Deduplication and Download and Extract corpus features. P.S.: This FR was originally raised here.