scijava / scijava-grab

Plugins for SciJava dependency grabbing
BSD 2-Clause "Simplified" License
2 stars 0 forks source link

Redundant package storage #1

Open kephale opened 7 years ago

kephale commented 7 years ago

Many of the files that end up getting stored in .scijava are already in people's .m2 repositories (as well as Fiji.app/jars in some cases). It would be great if scijava-grab could work from the .m2 local Maven storage.

ctrueden commented 7 years ago

@kephale I agree that the redundancy is unfortunate. Would you be satisfied with hard links? This is what I do in jrun and it seems to work very well, even on Windows. That way, if someone wipes or messes with .m2, it does not affect artifacts grabbed by this mechanism, and vice versa, but we still reap the storage benefits.

However, while jrun leans on mvn for the heavy lifting, scijava-grab does its own downloading; hence, the downside here is that if something is grabbed by scijava-grab first, and then later pulled in to .m2/repository by a Maven build, hard links would not be used, since Maven would not do it.

Therefore, we could opt to go a different direction and make scijava-grab load directly from .m2/repository if present. This might be saner and more reliable. Perhaps even better would be to lean on Maven's Java API to pull in the remote artifact to .m2/repository, and then always load it from there, dispensing with the .scijava cache completely. But that is probably more work.

kephale commented 7 years ago

Hard links are fine and a definite improvement.

What led me to scijava-grab was using it from scijava-jupyter-kernel to fetch a Clojure dependency manager (https://github.com/cemerick/pomegranate) which populates dependencies into .m2/repository using Sonatype Aether under the hood. The reason for the circuitous route only being that I setup my older Beaker notebooks to work that way.

Practically speaking, I suspect the standard workflow will be (at least for me): prototype in jupyter then migrate code to a proper Maven project. For that workflow I suspect it would be better if scijava-grab populated the .m2/repository, since the notebooks will likely fetch the newer dependencies before Maven is used to fetch them.

Then of course one could bring up the ability to use .m2/repository artifacts with imagej-updater somehow which would take this discussion down a rabbit hole, but it should probably at least be mentioned in this context.

Any improvement would be great, but since everything is functioning now there isn't a need to rush.

ctrueden commented 7 years ago

Cool, I didn't know about pomegranate. Would it be possible to consume the pomegranate API from Java? Then we could use it instead of Groovy Grape under the hood.

If not, we can look into porting the relevant portion of pomegranate (i.e.: just directly calling Aether—which is unclear to me whether it is still called Aether since it left the Eclipse umbrella...).

Then of course one could bring up the ability to use .m2/repository artifacts with imagej-updater somehow which would take this discussion down a rabbit hole, but it should probably at least be mentioned in this context.

That would be awesome. PRs very welcome, ha ha.

If effort is being invested into using Maven for fetching, then I wonder if it is possible to do it in a way that the imagej-updater could take advantage of such Maven support in the future

Agreed. Certainly, I think we need to reach a place where the SciJava Jupyter Kernel ships as few artifacts as possible. We want all notebooks to begin with #@grab etc. for total version reproducibility. Otherwise, the version of SJJK will impact the behavior of the notebooks in negative ways.

As for ImageJ itself via the Updater: the first step would be updating the db.xml.gz data model to support the idea that binaries can be fetched from the ImageJ Maven repository public group, and not just from the update site itself. And then of course update the Java code to perform the right case logic based on the XML contents. The checksum used in this case would be the md5 in the Maven repo, so the updater has a quick way of validating that the Maven artifact is indeed the artifact in question.

Please feel free to file an issue in imagej-updater about this. But it is unlikely anyone at LOCI will have time to work on it this year. This may sound weird, but on the totem pole of priorities, imagej-updater is quite low, since it only impacts the ImageJ application itself, and not all the other use cases like Jupyter, KNIME, OMERO, etc.

kephale commented 7 years ago

pomegranate would require minimal adaptation to be consumable from Java; however, it would not be very comfortable to use it because of the number of native Clojure types it uses as arguments to functions. pomegranate basically has 2 namespaces: 1 is the primary public facing one that can be easily ported, but the other does a fair amount of legwork and from a quick skim, it will be unpleasant to port.

The path of least resistance would probably be to write a thin Clojure library that exposes pomegranate using native Java types instead of Clojure ones. I may regret this, but I could take care of that.

I completely understand putting imagej-updater Maven support at the back of the queue. I just wanted to get it on the radar, with the hope of getting more backend overlap with imagej-updater and other scijava dependency management.

One other thought: if a good solution for integrating Maven into scijava-grab can be achieved, then perhaps an updater plugin could be put together from scratch that makes no attempt at backwards compatibility and only focuses on providing Maven-based package management within ImageJ. I know that there are a bunch of perks of the way imagej-updater works currently, so what I'm suggesting is not necessarily a replacement, but a thought-experiment that could be useful to some folks (I think @fjug was interested in Maven support in imagej-updater as well).

ctrueden commented 7 years ago

I thought about it, and came up with a simple way to move forward that does not require too much development time:

  1. When a grab is requested, we first look in ~/.m2/repository (or $M2_REPO or wherever is configured), and hard link any existing artifact from there to .scijava/grapes.

  2. When no such artifact exists, we then lean on Grape as we are doing now to resolve the artifact remotely into .scijava/grapes. Then, we hard link the artifact from there into the local Maven repository cache, so that those artifacts are also available to Maven.

There are some subtleties, such as ensuring the Maven metadata files are updated correctly, but all in all I think it is doable, and does not require us to rewrite things to use Aether or pomegranate.

As an aside, we may want to change the default SciJava location to .scijava/grab instead of grapes, so that it does not become a misnomer if we change the grabbing backend internally.

One other thought: if a good solution for integrating Maven into scijava-grab can be achieved, then perhaps an updater plugin could be put together from scratch that makes no attempt at backwards compatibility and only focuses on providing Maven-based package management within ImageJ.

I think you are right that a clean break here is the easiest route forward. There are actually several reasons this would be more realistic. :+1:

kephale commented 7 years ago

If scijava-grab avoided packages that are already on the classpath when fetching dependencies, then I believe the issues that I have been running into about multiple copies of libraries on the classpath would not arise.

ctrueden commented 7 years ago

If scijava-grab avoided packages that are already on the classpath when fetching dependencies, then I believe the issues that I have been running into about multiple copies of libraries on the classpath would not arise.

Sure, we could do that. But along with that, I'd like to reduce the system classpath to be as small as possible, to make as many things as possible be grabbable. I am concerned about the "bootstrapping problem" of scijava-common here—it would be ideal if scijava-common itself could also be grabbed, for better reproducibility of notebooks. I really want to avoid needing to pump out new versions of SJJK whenever scijava-common gains new features needed by ImageJ.

Perhaps scijava-grab should be standalone and not require scijava-common at all. And SJJK should use some scheme so that scijava-common can be grabbed and then utilized to execute cells. Maybe just provided scope in Maven with judicious catching of ClassNotFoundException would be good enough, once scijava-grab does not need SJC to work anymore.

kephale commented 6 years ago

The issue I posted as https://github.com/scijava/scijava-jupyter-kernel/issues/83 should probably just be a bump on this thread.

At the moment I find myself just wanting to be able at least exclude some packages from a #@dependency() fetch.

Specifically, I wish something like this could work (even if I have to enumerate everything that needs to be excluded, if it comes down to that):

#@repository("http://maven.imagej.net/content/repositories/snapshots/")
#@repository("http://maven.imagej.net/content/repositories/releases/")
#@GrabExclude('net.imagej:imagej-ops')
#@dependency(group="fun.imagej", module="fun.imagej", version="0.2.4")
ctrueden commented 6 years ago

I think the hardcoded dependencies are a different issue than minimizing the storage requirement of cached JARs using hard (or soft) links.

Regarding dependency excludes: while they "would be nice" for sure, I think it's paramount that we minimize the SJJK classpath, as I described above. Resolving this is one of my main hackathon priorities, so that SJJK Jupyter notebooks are fully reproducible. My first approach will be the provided scope described above, so that we don't ship scijava-common, but do depend on it for script execution. If it works, it will resolve scijava/scijava-jupyter-kernel#83 (but not this issue here!). Ideally, we'll also find a way to gain this benefit for Python 3 notebooks leveraging ImgLyb et. al...

kephale commented 6 years ago

Okey doke, got it! Thank you!! Looking forward to this one in particular, it is the last thing holding back fun.imagej notebooks.

ctrueden commented 6 years ago

@kephale Have you tried BeakerX lately? It has a Clojure notebook kernel.

I ask because we are migrating all the existing SJJK-based notebooks to the BeakerX Groovy kernel instead. I asked on the beakerx Gitter channel whether the BeakerX %classpath magic has an exclude feature. As you point out, with Groovy you can @GrabExclude. If the %classpath magic (which works with all BeakerX kernels) does not support excludes yet, I'm guessing a PR would be doable adding such a thing.