uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Raise an explicit error when TransformSpec is given a shape with a variable dimension #634

Closed selitvin closed 3 years ago

selitvin commented 3 years ago

Records returned by make_batch_reader are returned as a dictionary of coalesced numpy arrays. In order to properly batch multiple rows together, all fields must have the same dimension. If a user passes an shape that have some variable dimensions, multiple rows won't coalesce properly.

This PR issues a clear message.

See also: #633

codecov[bot] commented 3 years ago

Codecov Report

Merging #634 (4ba05a6) into master (6c64a37) will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #634   +/-   ##
=======================================
  Coverage   85.35%   85.36%           
=======================================
  Files          85       85           
  Lines        4978     4980    +2     
  Branches      790      791    +1     
=======================================
+ Hits         4249     4251    +2     
  Misses        589      589           
  Partials      140      140           
Impacted Files Coverage Δ
petastorm/arrow_reader_worker.py 90.84% <100.00%> (+0.12%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6c64a37...4ba05a6. Read the comment docs.