uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

fix get_dataset_path() in fs_utils.py #663

Open dongpohezui opened 3 years ago

dongpohezui commented 3 years ago

version: Windows 10 Python 3.7.0 petastorm 0.9.8 pyarrow 3.0.0

from petastorm import make_reader

reader = make_reader('file:///D:/test/test.parquet') 
print(reader)

The following error occurred when I ran the above code under Windows.

picture

The image above shows that there is an extra '/' at the beginning of the file path. So I deleted the extra '/' .

CLAassistant commented 3 years ago

CLA assistant check
All committers have signed the CLA.

codecov[bot] commented 3 years ago

Codecov Report

Patch coverage: 33.33% and project coverage change: -0.04% :warning:

Comparison is base (10e0fc8) 85.22% compared to head (271da38) 85.19%. Report is 70 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #663 +/- ## ========================================== - Coverage 85.22% 85.19% -0.04% ========================================== Files 85 85 Lines 4981 4984 +3 Branches 791 792 +1 ========================================== + Hits 4245 4246 +1 - Misses 596 597 +1 - Partials 140 141 +1 ``` | [Files Changed](https://app.codecov.io/gh/uber/petastorm/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | Coverage Δ | | |---|---|---| | [petastorm/fs\_utils.py](https://app.codecov.io/gh/uber/petastorm/pull/663?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL2ZzX3V0aWxzLnB5) | `90.00% <33.33%> (-1.76%)` | :arrow_down: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

selitvin commented 3 years ago

Did you manage to track down what is the origin of this extra '/'? Which part of code added it? Asking this to make sure the same issue wouldn't pop up in some other scenarios with different code paths.

Also, can you please add a unittest covering this case?

dongpohezui commented 3 years ago

The URL comes from user input. Under Linux, it works just fine. Under Windows, it makes mistakes.

In get_filesystem_and_path_or_paths(), URLparse () is used to handle URLs.

Under Windows, the file cannot be accessed via the file path "/D:/test/test.parquet ". Instead, it should be "D:/test/test.parquet".

In my opinion, you can only test this code under Windows.