vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.78k stars 196 forks source link

Publish an image only using the basic primitives #200

Closed incaseoftrouble closed 3 years ago

incaseoftrouble commented 3 years ago

If I recall correctly, I already added support for building a reduced jar ("fastutil-min"). It might make sense to actually publish this.

Reasoning: Asides from memory usage due to array packing there is, to my knowledge, no reason to use any key/value types except int, long and double, as essentially all operations are only defined for these types and operations on the other "primitives" even incur a performance penalty due to conversion. One could discuss whether boolean types could be included (XY2BooleanMap and BooleanList have some merit I guess). Advantage are smaller image (not that important), less clutter (autocompletion isn't spammed with 100 different types) and faster to work with (indexing fastutil does take a while).

Maybe the "full" fastutil image could even depend on the "min" image, however I guess this is a bit tricky to work with.

EDIT: floats are also natively supported by the JVM, however they come with many other pitfalls, so I still think that they should not be included in the core image

vigna commented 3 years ago

That's a very good idea. Let's work it out. fastutil-core and fastutil-extra out something like that.

vigna commented 3 years ago

I guess choosing int/long/double for fastutil-min makes a lot of sense because these are the types that get special treatment in the JDK.

vigna commented 3 years ago

OK, at this point NO_SMALL_TYPES generates a smaller fastutil (1/3 of the classes). To avoid conditional compilation (i.e., to keep only conditional inclusion) some classes, such as Consumers or Functions, must be included for all types. But we gain a lot of space because of all missing map implementation.

If you wanna have look that'd be great—it's possible that still some class can be removed from the current set. The plan would be to release a "fastutil-core" and a "fastutil-extra" (depending on the first).

incaseoftrouble commented 3 years ago

Is that on master?

vigna commented 3 years ago

Yes

incaseoftrouble commented 3 years ago

Ok, I'll have a look tomorrow / Wed

incaseoftrouble commented 3 years ago

So two small things:

Bigger question: Do we even want the IO package in the "core" image? It is general purpose, but I guess that it mostly is used for (de)serializing, which is also handled by libraries like protobuf / kryo. Other purposes (e.g., reading raw data), seem very specialized, I guess that only a small fraction of users are using that feature. Removing I/O would cut down the image a lot. I could take care of adapting the makefile etc. appropriately. A possible compromise would be to change the IO packages to only use the "basic" types, this doesn't make too much sense though.

This has another implication though which might cause a headache: "Aggregating" classes like BigArraysCommon etc. then need to be conditional on whether one builds the "full" or "core" distribution. But then, "full" cannot depend on "core" anymore, since both distributions need to include that class (with different content), leading to a name clash. This could be solved but requires a lot of fiddling.

Splitting them up completely doesn't sound bad at first (actually, it makes generating the distributions much easier), but leads to complications if, say, library A depends on "core" and library B depends on "full" and library A, leading to a name clash. Users can solve this by overriding the dependency of library A on core with "full" in the respective dependency management, but this needs to be done manually.

vigna commented 3 years ago

Maybe we can simply to three layers: fastutil-core, fastutil-io, fastutil-extra. That would move a lot of interfaces away from fastutil-core.

But, no, the round of modifications I made are exactly to avoid conditional compilation. Conditional insertion is fine, but the classes must be the same in all distributions.

incaseoftrouble commented 3 years ago

What do you mean by conditional insertion?

Maybe we can simply to three layers: fastutil-core, fastutil-io, fastutil-extra. That would move a lot of interfaces away from fastutil-core.

Problem is that BigArraysCommon etc. also use all the types ...

vigna commented 3 years ago

I mean that it's OK if we create jars with separate .class files, but there must not be different .class files (e.g., a BigArrays with only basic types and a BigArrays with the remaining types). The burden does not seem too much IMHO. BigArrays brings in probably only comparators.

incaseoftrouble commented 3 years ago

Hm okay, got it to work. The only quadratic type that was needed are functions (functions are needed by maps, and functions have andThenXY for all types XY, hence all functions). The src/io package also works (FastXYStream etc), 889 classes in total (compared to ~2.2k complete fastutil). I also removed the -D SMALL_TYPES switch, since the java code generation must not depend on whether small types are generated or not anyway.

I was thinking about a) Reference types (this would remove a lot) and b) BigXY (BigList, BigArrays etc.), these seem to be quite specific, too. This possibly removes

techsy730 commented 3 years ago

JDeps may be a good tool to automatically make such splits. Using jdeps -v -R -e 'it\.unimi\.dsi\.fastutil($|.*)' --classpath <path_to_jar_file> {wildcards of classes we want in each distribution}

-v means verbose, giving us full class -> class level -R means recursive, so also look at transitive class deps -e means filter packages looked at to this regex (there is also -p for a literal package, not a regex)

we can produce a dependency graph of what each class "really" requires.

Unfortunately it also spits out a bunch of extra data we don't want (like at the end it goes ahead and shows the dependencies of all of the classes in the jar, even the ones we don't ask for in the command line), so we will need to write a wrapper to parse the output.

Also, jdeps is a tool introduced in Java 9, so that would bump our build time requirements (or at least build time if you split) to JDK 9. But Java 9 also introduces the -release parameter to javac, specifying which Java language level to conform to and even whose Java API to load (which the old -source and -target didn't do, it also implicitly sets these two for you).

vigna commented 3 years ago

Compilation already needs Java 9 as we're already using --release.

jdeps seems a good idea indeed.

vigna commented 3 years ago

I'm not entirely sure -e is what we need. It filters the dependencies, not starting point. Right?

techsy730 commented 3 years ago

Right, but without it shows dependencies onto java. classes, which are things we aren't interested in for this purpose.

vigna commented 3 years ago

I have just committed a split.sh script that tries to do this.

Presently, it generates three jars: core, with ints/longs/doubles/objects, I/O and it.unimi.dsi.fastutil; then references, bytes and chars; and then rest of the classes.

This is the first splitting strategy that came to my mind, but I'm entirely open to suggestions. The number of classes are as follows:

   2938    2938  195772 fastutil-core.txt
   4310    4310  295777 fastutil-refbytechar.txt
   5529    5529  382059 fastutil-boolshortfloat.txt

and this is the jar size:


-rw-rw-r-- 1 vigna prof  5687046 Mar 13 09:58 fastutil-core.jar
-rw-rw-r-- 1 vigna prof  7875255 Mar 13 09:58 fastutil-refbytechar.jar
-rw-rw-r-- 1 vigna prof  9940402 Mar 13 09:58 fastutil-booshortfloat.jar```
vigna commented 3 years ago

I moved reference type to the second block, following a suggestion in this thread.

However, here comes the issue: how should one manage modules in this case? Presently we use bnd to get an OSGi-compliant jar, but I guess we will need to have multiple modules.

vigna commented 3 years ago

I just discovered that in OSGi you cannot split a package across bundles. This makes it impossible in practice to split the jar and being OSGi compliant. Maybe we can have two distributions, a fastutil.jar and a fastutil-core.jar, but that could be problematic, too.

techsy730 commented 3 years ago

Yeah keeping the jars disjoint would be ideal. Would avoid nasty duplicate definition possibilities in many build systems.

I think Java 9 modules support splitting across package boundaries, but I am not positive on that. Kind of pointless thought exercise too as we don't require Java 9 yet (for building yes, but not for distribution format)

vigna commented 3 years ago

I'm reading the OSGi documentation (ugh) and apparently using Require-Bundle it is possible to have a package split between bundles (and for Java 9 we should be able to do it with modules).

Its use is in general discouraged, but it seems to fit our use case. I'd see a situation with:

I'm uncertain whether to call the middle jar fastutil-extra.jar or fastutil-refbytechar.jar. The second name is more descriptive, but really ugly. Thoughts?

vigna commented 3 years ago

OK, "make stage" at this point should do the right thing. The thing I haven't been able to do is to have a POM-only Maven thing called just "fastutil" (no jar associated) that depends on fastutil-boolshortfloats.jar (and thus brings in all jars). Any clue, anybody? The Maven sign-and-deploy plugin needs a file. I might create an empty jar but it looks ugly.

The alternative is to name fastutil-boolshortfloat.jar fastutil.jar. I'm a bit not at ease with this solution because if you look at the artifact in isolation it looks like the old fastutil.jar, and that's not true...

techsy730 commented 3 years ago

I think switching the type of artifact in the pom file from "jar" to "pom" allows you to have a file less maven artifact cleanly. I think that is recommended practice for "dependency only" artifacts

vigna commented 3 years ago

This is what I did, but the plugin requires a file. I ended up uploading the pom itself.

But I'm not entirely convinced. Without a jar we cannot bind the module name it.unimi.dsi.fastutil. So we would be forced to bind the module name to the jar fastutil-boolshortfloat.jar, which is weird, too.

vigna commented 3 years ago

@incaseoftrouble : any comments? You opened this issue so I'd really like to get your input...

incaseoftrouble commented 3 years ago

Sorry, I've been busy moving to a new workplace.

I have no idea about modules, OSGi or maven. Having some kind of "meta" package should be possible (i.e. something that pulls in dependencies without any files itself) but I'm not sure how to facilitate that.

Maybe you have a different naming, "fastutil" for everything, "fastutil-core" for the minimal stuff, "fastutil-ext" (not sure about that one) for core + refbytechar? Then there is no need for a meta package. But I think fastutil-extra or fastutil-extended is good for refbytechar.

But in general, this is going exactly the direction I imagined. Having the classes in disjoint packages that depend on each other is the right way to go.

vigna commented 3 years ago

OK, after a few days on stackoverflow it turned out the only comments you can get there are completely useless—apparently, I am the problem. I think I'm going towards fastutil-core (int/long/double/objects), fastutil-extra (+ref/byte/char) and fastutil (everything else). From the viewpoint of the standard user nothing changes except more jars, but resource-savvy users can choose smaller jars.

vigna commented 3 years ago

I released 8.5.3 with the split jars. Let's hope that everything works as expected. 🤞🏻

vigna commented 3 years ago

So, pretty nothing worked as expected.

vigna commented 3 years ago

My split.sh script misses a few dependencies, and people are having problems with cross-package dependencies.

A possible solution is fastutil.jar, as usual, and fastutil-core.jar for people needing the basic classes. If anybody gets both in the dependency set of a project, fastutil-core.jar will have to be excluded manually.

incaseoftrouble commented 3 years ago

Hm, I guess that should work.

Doing this in gradle should be as easy as

configurations {
    all*.exclude group: 'it.unimi.dsi', module: 'fastutil-core'
}

or something like that...

Gradle also has what is called "feature variants" but they aren't fully compatible with maven.

gunnar-ifp commented 3 years ago

Hi. Is there a specific reason to put floats away into the "rest of it" jar? float and int are 32 bit. double and long are 64 bit. (object are objects.) These primitive types are fully supported by the JVM instruction set with floats being quite fast actually and that's what I would consider the core. Bytes, Booleans, Shorts and Characters may take less memory in arrays, but otherwise are always expanded to int when working with them, and I can imagine using an int collection for them unless I really want the specific one. These types could go in a extra jar (with references).

incaseoftrouble commented 3 years ago

This of course is up for debate but, in my opinion, floats are pretty niche.

So in summary, doubles are recommended over floats by default, switching to them doesn't even necessarily give you a performance boost / might even decrease (especially if they are only a smaller part of your computation), and switching can be dangerous if you don't know much about numerics. Especially, most machines are 64 bit anyway so if the operations aren't vectorized, there is little to be gained AFAIK.

That's my opinion, feel free to argue.

PS: Indeed, the initial post is wrong in the sense that float is fully supported, I'll edit appropriately.

vigna commented 3 years ago

These are the typer for which the JDK has primitive operators. That's the criterion.

gunnar-ifp commented 3 years ago

@incaseoftrouble Yes floats my be niche from your side, but I am looking at fastutil as an "outsider", as a very nice and fast primitive collection framework. And I just couldn't understand how they didn't make the core even though they are one of the 4 primtive types fully supported at byte code levels. I can understand your reasoning and I have no problems with using the full jar. It just felt a bit arbitrary, that's all there is to it. And IMHO two jars, one with the 4 types + objects and one with the rest feels more natural for a "wanna use a primitive collection" outsider like me.

As for floats, for me they are not so niche. They are the types used in PDFBox (which could really benefit from primitive collections, something I plan to work on). I also use them for PDF Layout detection (where I want to use Fastutil with float maps). As for their performance (that has nothing to do with Fastutil), I use them for image scaling (where floating point makes the code nice, clean, and they are on par with integer arithmetics if you don't need divisions). Floats are a bit faster there than doubles, they seem to divide (for alpha) a bit slower and of course I need to touch double the memory when working with them, which is a factor when you calculate 100 million pixels with 4 components each. I do know the limits of float/double, it's a lot of fun to manipulate the raw bits now and then ;). And I did fear to run out of the 24 bits of precision with floats, but even with 16 bit SRGB component values these rounding errors seem to be so tiny, they don't show up.

incaseoftrouble commented 3 years ago

It just felt a bit arbitrary, that's all there is to it.

Well, its not arbitrary in the sense that int, long, and double are the standard things you will use (typically). floats simply should not be used except you really know what you are doing (which is rare!). So, in a sense, this nudges people with little experience into "best practice" :-)

As for floats, for me they are not so niche.

Well, your application is quite niche I'd say. Having millions of data points where each has an independent precision requirement is rare I guess. (I might be wrong!) So, yes, I see the point why you might need floats, but I still do think its sensible to exclude them from the "core" version since they so rarely provide real benefit.

@vigna gunnar indeed has a point that floats also have native support in the JVM (I did not know that until now). But I still think they should be excluded due to mentioned reasons.