Closed WeichenXu123 closed 4 years ago
Nice feature!
Merging #504 into master will increase coverage by
0.23%
. The diff coverage is90.47%
.
@@ Coverage Diff @@
## master #504 +/- ##
==========================================
+ Coverage 85.77% 86.00% +0.23%
==========================================
Files 79 81 +2
Lines 4190 4331 +141
Branches 665 683 +18
==========================================
+ Hits 3594 3725 +131
- Misses 494 500 +6
- Partials 102 106 +4
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/reader.py | 90.82% <ø> (ø) |
|
petastorm/arrow_reader_worker.py | 91.72% <90.47%> (-0.28%) |
:arrow_down: |
petastorm/spark/__init__.py | 100.00% <0.00%> (ø) |
|
petastorm/spark/spark_dataset_converter.py | 93.27% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a61fe13...e918bad. Read the comment docs.
@selitvin Ready. :)
Make
make_batch_reader
TransformSpec support output multi-dimensional array type.Why we need this feature ?
The project: Simplify data conversion from Spark to TensorFlow: Spark converter basic implementation is implemented via
make_batch_reader
API. And user may want to do some preprocessing and return some tensor (multi-dimensional array) in the preprocess function.Currently,
make_batch_reader
TransformSpec func only allow return one-dimensional array because of pyarrow format limitation.How does the PR address it ?
This PR address this issue. The approach is:
Flatten the multi-dimensional array returned by TransformSpec func, and in
ArrowReaderWorkerResultsQueueReader
loading data code, reshape back to the specified shape. Note reshape is a in-place operation so it won't affect performance.Manual test code