hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
83 stars 21 forks source link

Can't use guacamole in spark-shell due to guava-shading nuances #588

Closed ryan-williams closed 7 years ago

ryan-williams commented 7 years ago

Interval exposes com.google.common.collect.Range in its public API, which in some way means that that reference doesn't get relocated by the shade plugin.

I don't fully understand why; when I package an assembly (or guac) JAR with -Puber (resp. -Pguac), and extract the class files from it, I see an org/hammerlab/guacamole/reference/Interval.class that contains, in ASCII, toJavaRange.()Lorg/hammerlab/guavarelocated/collect/Range, and no instances of google, which I'd have thought meant that the shading was proceeding correctly / as intended.

Yet, when I run:

spark-shell --jars target/guacamole-with-deps-0.0.1-SNAPSHOT.jar

then in the shell I get this error:

scala> org.hammerlab.guacamole.reference.Interval(10, 20)
error: bad symbolic reference. A signature in Interval.class refers to term collect
in package com.google.common which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling Interval.class.

I think this has to do with the "classes exposed in public API can't be shaded" issue that I've heard tell of around Spark. I've tried addressing it by not shading Range, or by shading but not relocating it, and in both cases doing the same to a various groups of classes around it, but in the end it seems that our use of Range.closedOpen (which only showed up in Guava 14.0.1, while Hadoop 2.* pins us with an un-shaded Guava 11.0.2), means that it's impossible to make this work.

One possible escape-hatch is the "user classpath first" Spark configs, but even in the latest Spark release they are deemed experimental, and IME they've accordingly introduced other problems when I've tried to use them in similar situations in the past, so I'm not eager to hitch ourselves to them.

My only other idea for proceeding is to remove that Range reference from Interval's public API, further characterizing what kinds of things we can and can't shade.

I'd love to get to the bottom of how the shell is even seeing a reference to com.google.common.collect, allegedly in a .class file that doesn't seem to contain any such references, but that's further in the weeds so not my first plan of attack.