Open priyankaexp opened 5 years ago
Seems like reader is looking for petastorm metadata which is not available in this case.
As another approach, while reading via make_batch_reader, is it possible to read records of the same feature in a single batch? As petastorm reads one rowgroup in a batch, I am not sure if we can groupBy records and write in the same rowgroup in parquet and read later? What do you think the best solution in the case?
Yep, ngrams would not work with make_batch_reader
. You would need to do some sort of your own custom implementation on top of the entire batch.
Grouping-by externally could work. Of the top of my head, you could partition your data by the criteria of interest. Each partition is stored in a separate directory (hence separate files), which means you will get rowgroups with the same partition-by value.
Ngram requires Unischema field as an input. How to work it with external parquet data set?