Open WeichenXu123 opened 4 years ago
Merging #512 into master will decrease coverage by
0.16%
. The diff coverage is72.91%
.
@@ Coverage Diff @@
## master #512 +/- ##
==========================================
- Coverage 86.02% 85.86% -0.17%
==========================================
Files 81 81
Lines 4402 4442 +40
Branches 704 713 +9
==========================================
+ Hits 3787 3814 +27
- Misses 504 511 +7
- Partials 111 117 +6
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/tf_utils.py | 80.91% <ø> (ø) |
:arrow_up: |
petastorm/spark/spark_dataset_converter.py | 87.5% <25%> (-3.13%) |
:arrow_down: |
petastorm/reader.py | 90.32% <77.77%> (-0.68%) |
:arrow_down: |
petastorm/arrow_reader_worker.py | 90.34% <83.87%> (-1.66%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 0b70510...529cb83. Read the comment docs.
I create a simple PR to address issue 1, https://github.com/uber/petastorm/pull/517 We can merge that one first. This PR could be a long-term work.
What issues does the PR addresses ?
There're 2 issues in
make_batch_reader
, one is critical and another is less critical but a pain point.(Critical) Inferring schema in
make_batch_reader
cannot infer fields' shape informationBecause there's no shape information, when make tensorflow dataset from the reader, if we make some tensorflow dataset operations, such as unroll, batch, and reshape field, error may occur. Tensorflow graph operator depends on field shape information heavily.
(Pain point) The
TransformSpec
need to specify edit/removed fields manuallyWe hope user can only provide a transform function, and petastorm can automatically infer the result schema from the output pandas dataframe of the transform function.
The approach in the PR
Add a method
ArrowReaderWorker. infer_schema_from_first_row
which can read a row first and infer the schema from the row. So that we can infer the accurate shape information. Add a paraminfer_schema_from_first_row
intomake_batch_reader
(default off, so won't break API behavior)Limitations:
Test
Unit test to be added. But it is ready for first review.
Example code