update hadoop to more recent version

krisiasty commented 4 years ago

Currently used hadoop version (2.7.3) is way too old (released back in June 2017). One of the consequences is the missing support for "fs.s3a.path.style.access" property of s3a filesystem layer, which means the s3-compatible object store must be configured with virtual hosting for buckets. This in turn is not supported on the OpenShift Container Storage 4.4 (or at least it is not enabled by default and not properly documented how to configure this feature).

That means the Spark + Object Store example in the basic tutorial won't work on the latest OpenShift Container Platform (4.4) with OpenShift Container Storage.

Despite having the following settings:

s3_endpoint_url = 'https://s3.openshift-storage.svc:443'
s3_bucket = 'odh-jupyterhub-9654ef69-1f36-48f1-b50f-4d2dbef1357d'

hadoopConf.set("fs.s3a.path.style.access", "true")

the code from tutorial raises exception trying to connect to the bucket vi virtual host (http://bucket.s3endpoint/ instead of https://s3endpoint/bucket/):

Py4JJavaError: An error occurred while calling o96.csv.
: com.amazonaws.AmazonClientException: Unable to execute HTTP request: odh-jupyterhub-9654ef69-1f36-48f1-b50f-4d2dbef1357d.s3.openshift-storage.svc: Name or service not known

elmiko commented 4 years ago

thanks for the report @krisiasty , unfortunately most of the artifacts in this repo are in a maintenance mode. i've labelled this as a bug, but i'm not sure how quickly someone would be able to address it.

rimolive commented 4 years ago

PR #110 should handle this as we're adding Spark 3.0 with Hadoop 3.2 jars.

radanalyticsio / openshift-spark

update hadoop to more recent version #109