uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Added tests for test_parquet_reader.py #445

Closed gregw18 closed 3 years ago

gregw18 commented 5 years ago

Added tests for selecting specific columns and requesting invalid columns to test_parquet_reader. Modified specific column test to request specific column names, rather than regex patterns, so could select even columns rather than odd, so would always find at least one. Added comment explaining why regex patterns were a problem.

selitvin commented 5 years ago

this is a good test. We can definitely add it.

I was considering to introduce #446. It may be a better behavior for the regex matching. What do you think?

gregw18 commented 5 years ago

That would definitely simplify my test, and better match how I expected that parameter to work, but my experience in this area is limited so my opinion isn't particularly informed! I hope that there aren't many people relying on the current behavior.

selitvin commented 5 years ago

Yes. This is an unpleasant change short term and may break some people code. That being said, that's the right behavior to have long term.

gregw18 commented 5 years ago

It definitely follows the principle of least surprise, at least to a long time dos and windows programmer. I'm also new to contributing to open source projects. Should I wait for #446 to be merged to master and then update my test?

selitvin commented 4 years ago

Sorry for a delayed reply. Was travelling. I'll try to land 446 soon and have a release within couple of days.

selitvin commented 4 years ago

Since #446 has landed on master, is ti ok to close this one?

gregw18 commented 4 years ago

I apologize for my confusing commit messages, but the purpose of these tests was to address #257 - I just happened to run into and get surprised by the schema_fields behavior. I hope that these tests are still useful.

selitvin commented 4 years ago

Should we continue working on this PR? If yes, can you please rebase?

gregw18 commented 4 years ago

Updating this and #444 have been on my list, but I needed the reminder...

codecov[bot] commented 4 years ago

Codecov Report

Merging #445 into master will decrease coverage by 0.21%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #445      +/-   ##
==========================================
- Coverage   85.98%   85.77%   -0.22%     
==========================================
  Files          81       79       -2     
  Lines        4311     4190     -121     
  Branches      674      665       -9     
==========================================
- Hits         3707     3594     -113     
+ Misses        499      494       -5     
+ Partials      105      102       -3
Impacted Files Coverage Δ
petastorm/unischema.py 94.58% <0%> (ø) :arrow_up:
petastorm/spark/spark_dataset_converter.py
petastorm/spark/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e3acecf...a31035d. Read the comment docs.

selitvin commented 4 years ago

Can you please rebase? Have a conflict with master for some reason...

gregw18 commented 4 years ago

Please let me know if this still generates a conflict, or is just too ugly (5000 commits for 2 functions!) and I will cut and paste the changes into a new PR. (I've learned a lot about git since starting this and hopefully will be cleaner going forward.)

selitvin commented 4 years ago

Hmmm. Github still see a conflict. You don't see the same message? image

gregw18 commented 4 years ago

Thanks for the screenshot - I had assumed that I wasn't looking in the right spot for the error. Github is telling me that there are no conflicts, with the caveat that I don't have write access and so can't merge pull requests. image I hope that it's ok, but I've created a clean PR, #486, to replace this one. Hopefully it will let you land this and not spend any more time on it!

selitvin commented 4 years ago

Awesome! Landed the new PR. Thanks for the tests!

On Fri, Feb 7, 2020 at 11:49 AM gregw18 notifications@github.com wrote:

Thanks for the screenshot - I had assumed that I wasn't looking in the right spot for the error. Github is telling me that there are no conflicts, with the caveat that I don't have write access and so can't merge pull requests. [image: image] https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_50498452_74059415-2Dffef4300-2D49b5-2D11ea-2D8c76-2D45847e861836.png&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=9ZOlsbdIJ9b5MqHnADbKZA&m=lsLbG13zRSj0GGt1J6zIoKJnRlJOEfgceRLgboGI5QQ&s=ckiopI5g-8cBZe1zf725TTyKG5fjzg13iWUE_idVJEM&e= I hope that it's ok, but I've created a clean PR, #486 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_uber_petastorm_pull_486&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=9ZOlsbdIJ9b5MqHnADbKZA&m=lsLbG13zRSj0GGt1J6zIoKJnRlJOEfgceRLgboGI5QQ&s=TEbzBRiOIkGJ1TwGCtVVXu6Br49cUQlayyjwjk5Dhjo&e=, to replace this one. Hopefully it will let you land this and not spend any more time on it!

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_uber_petastorm_pull_445-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAZIEDH3DKVYIID5JVCK3D3RBW3LXA5CNFSM4JKZT5J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELELRMA-23issuecomment-2D583579824&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=9ZOlsbdIJ9b5MqHnADbKZA&m=lsLbG13zRSj0GGt1J6zIoKJnRlJOEfgceRLgboGI5QQ&s=c2WtmZj4WPJfz_AOOu9idPdVCuA2iyaPbUCeNhKQwa8&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAZIEDEX7M7AVDF65NQOZ5TRBW3LXANCNFSM4JKZT5JQ&d=DwMCaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=9ZOlsbdIJ9b5MqHnADbKZA&m=lsLbG13zRSj0GGt1J6zIoKJnRlJOEfgceRLgboGI5QQ&s=kFWHBUzUG9AOTQA6oODrH9lHh8b1KSS-jzY203X42n4&e= .