Closed skhalid7 closed 2 years ago
hey @skhalid7 what's the error you get with dataproc? And how are you installing Glow?
Any more details you can provide will be helpful
https://github.com/projectglow/glow/pull/467 opened this to track, it may be that this change will cause issues with other libraries that glow depends on, such as hadoop-bam. Will test for the next release of glow (Spark 3.2). Next release will take some time as we are working with the spark-core team to figure out breaking changes in Spark 3.2
For now, can you use older versions of glow + dataproc that depend on older versions of hadoop / spark?
just confirming from circleci checks that changing the hadoop version does break the scala tests. It will not be possible to resolve this in the short term
Does dataproc support docker containers? If so we can work with you to adapt the glow docker container to work on dataproc
Hi Thanks for the confirmation. The error message I get when I try any io operation is:
'ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[readingParquetFooters-ForkJoinPool-1-worker-1,5,main] java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator'
I install glow using the maven artifact glow-spark3_2.12:1.1.1 and pip install. Unfortunately, all Spark 3 dataprocs come with Hadoop 3.2 pre-installed, so I can't use glow 1.0+ versions on it. I've been using glow 0.6 successfully on Dataproc.
Docker is supported and sounds like a good option.
Thank you!
The Glow docker container is built off the Databricks Runtime version of Spark in layers. The relevant genomics layers can be adapted for Dataproc. I expect you can then override the hadoop version, but I do not know how. Do you have a cloud engineer at Google who can help work on this? Please message your GCP account team to get them in the loop on this so we can chart a path forward
Thanks, I dropped you a message on the glow slack, alternatively is there an email I should reach out to you with?
hey @skhalid7, after consulting internally it may be significant development effort to get this working on dataproc, a few weeks of engineer time. And we do not have funding approved for this.
I am sure we can find a way but it will take a while. Are you able to use Databricks on GCP for this work? Or does it have to be with dataproc?
hey @skhalid7 we now have a container that should work on google cloud with GCS and includes hadoop 3 compatibility:
https://github.com/projectglow/glow/pull/503
This was contributed by @edg1983 with some modifications https://github.com/projectglow/glow/issues/494
The container is now on the projectglow dockerhub page here: https://hub.docker.com/r/projectglow/open-source-glow
projectglow/open-source-glow:1.1.2
Sorry it took so long, hopefully this solution will work for you on google cloud
Thank you very much!
Hi I'm running glow on GCP, using a dataproc. Spark3 is set up with Hadoop 3.2 there. Going over glows pom.xml, it seems that it's using Hadoop2.7, which causes dependency conflicts. Unfortunately changing the dataproc configurations isn't that straightforward, is there any way I can change the Hadoop dependency for glow to Hadoop 3.2?
Thank you