Adding the user parameter when pyarrow.hdfs.connect and using spark user when possible

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Adding the user parameter when pyarrow.hdfs.connect and using spark user when possible #386

Closed Ivan-Dimitrov closed 5 years ago

Ivan-Dimitrov commented 5 years ago

User is added as optional parameters for filesystem resolver. Various entry points for file system resolver (materialize dataset, row group indexer, etc) provide the spark user name to filesystem resolver. Usually the spark user name is gotten from HADOOP_USER_NAME environment variable