uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Can ngram be used with external dataset along with make_batch_reader? #430

Open priyankaexp opened 5 years ago

priyankaexp commented 5 years ago

Ngram requires Unischema field as an input. How to work it with external parquet data set?

priyankaexp commented 5 years ago

Seems like reader is looking for petastorm metadata which is not available in this case.

As another approach, while reading via make_batch_reader, is it possible to read records of the same feature in a single batch? As petastorm reads one rowgroup in a batch, I am not sure if we can groupBy records and write in the same rowgroup in parquet and read later? What do you think the best solution in the case?

selitvin commented 5 years ago

Yep, ngrams would not work with make_batch_reader. You would need to do some sort of your own custom implementation on top of the entire batch.

Grouping-by externally could work. Of the top of my head, you could partition your data by the criteria of interest. Each partition is stored in a separate directory (hence separate files), which means you will get rowgroups with the same partition-by value.