uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

[WIP] #317 - Allow passing of a username parameter to FileSystemResolver #363

Closed jsgoller1 closed 5 years ago

jsgoller1 commented 5 years ago

Problem

The FilesystemResolver class supports connecting to HDFS. However, there is no capability to pass a an HDFS username to the class - therefore, it is very easy to run into permissions issues on created directories; it is not clear if a user who creates a directory will be able to access it later on.

Solution

Allow passing of a username parameter to FileSystemResolver to correct this issue.

selitvin commented 5 years ago

Looks like a good feature. Please see if we can also cover some of it in a unit-test.

jsgoller1 commented 5 years ago

Locally, this PR is failing these tests:

test_torch_tensorable_types[int8]
TestSparkUtils.test_reading_subset_of_columns
TestSparkUtils.test_simple_read_rdd

I have confirmed that these tests are currently failing on master as well.

selitvin commented 5 years ago

Do you intend to abandon this PR?

Ivan-Dimitrov commented 5 years ago

New PR that adds this functionality. #384