uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

weighted_sampling_reader #770

Open weidezhang opened 2 years ago

weidezhang commented 2 years ago

one question I want to ask is why the weighted sampling reader requires all reader have the exact same schema ? Can we modify it so that each reader have different schemas ?

weidezhang commented 2 years ago

say i want to have a reader A contains schema (a,b,c) while another reader B schema contains (a,b,c,d). Can they both be sampled together ? The extra field in B will be used on extra loss function defined in the network.

selitvin commented 2 years ago

It was design that way to support old TF reading style : the schema must have been known in advance to hook the reader into TF graph. This might not be necessary when not reading from TF. If you'd like to propose a PR, I would be happy to take a look.

weidezhang commented 2 years ago

thx for the confirmation. we will propose a PR later.