projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

Hadoop 3 compatibility #465

Closed skhalid7 closed 2 years ago

skhalid7 commented 2 years ago

Hi I'm running glow on GCP, using a dataproc. Spark3 is set up with Hadoop 3.2 there. Going over glows pom.xml, it seems that it's using Hadoop2.7, which causes dependency conflicts. Unfortunately changing the dataproc configurations isn't that straightforward, is there any way I can change the Hadoop dependency for glow to Hadoop 3.2?

Thank you

williambrandler commented 2 years ago

hey @skhalid7 what's the error you get with dataproc? And how are you installing Glow?

Any more details you can provide will be helpful

williambrandler commented 2 years ago

https://github.com/projectglow/glow/pull/467 opened this to track, it may be that this change will cause issues with other libraries that glow depends on, such as hadoop-bam. Will test for the next release of glow (Spark 3.2). Next release will take some time as we are working with the spark-core team to figure out breaking changes in Spark 3.2

For now, can you use older versions of glow + dataproc that depend on older versions of hadoop / spark?

williambrandler commented 2 years ago

just confirming from circleci checks that changing the hadoop version does break the scala tests. It will not be possible to resolve this in the short term

Does dataproc support docker containers? If so we can work with you to adapt the glow docker container to work on dataproc

skhalid7 commented 2 years ago

Hi Thanks for the confirmation. The error message I get when I try any io operation is:

'ERROR org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[readingParquetFooters-ForkJoinPool-1-worker-1,5,main] java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator'

I install glow using the maven artifact glow-spark3_2.12:1.1.1 and pip install. Unfortunately, all Spark 3 dataprocs come with Hadoop 3.2 pre-installed, so I can't use glow 1.0+ versions on it. I've been using glow 0.6 successfully on Dataproc.

Docker is supported and sounds like a good option.

Thank you!

williambrandler commented 2 years ago

The Glow docker container is built off the Databricks Runtime version of Spark in layers. The relevant genomics layers can be adapted for Dataproc. I expect you can then override the hadoop version, but I do not know how. Do you have a cloud engineer at Google who can help work on this? Please message your GCP account team to get them in the loop on this so we can chart a path forward

skhalid7 commented 2 years ago

Thanks, I dropped you a message on the glow slack, alternatively is there an email I should reach out to you with?

williambrandler commented 2 years ago

hey @skhalid7, after consulting internally it may be significant development effort to get this working on dataproc, a few weeks of engineer time. And we do not have funding approved for this.

I am sure we can find a way but it will take a while. Are you able to use Databricks on GCP for this work? Or does it have to be with dataproc?

williambrandler commented 2 years ago

hey @skhalid7 we now have a container that should work on google cloud with GCS and includes hadoop 3 compatibility:

https://github.com/projectglow/glow/pull/503

This was contributed by @edg1983 with some modifications https://github.com/projectglow/glow/issues/494

The container is now on the projectglow dockerhub page here: https://hub.docker.com/r/projectglow/open-source-glow

projectglow/open-source-glow:1.1.2

Sorry it took so long, hopefully this solution will work for you on google cloud

skhalid7 commented 2 years ago

Thank you very much!