Open kleinj opened 7 years ago
Two comments on this:
new BitSet()
by NatBitSets.set()
). Bonus points for using boundedSet(n)
:)int
, double
and long
plus the implementations. This is more or less what is currently needed and I'd bet there won't be much more. The jar will have roughly 2-3MB, if I recall correctly.Re 1: I'd suggest keeping that that as a separate issue at the moment. If I understand correctly, replacing some data storage container with the fastutil counterpart is quite local, as the external interface (List, Set, Map) remains the same -- you might want to iterate differently, of course. For the BitSets, you probably have to touch more places, as Java does unfortunately not provide an interface for BitSet-like classes, right?
Re 2: That might be an option. If we can identify a reasonable subset that's more or less stable, that would remove some of the pain. How often do you expect we'd update the JAR due to releases? In the last time, there seem to have been releases ~monthly. For the ~3MB jar, would that include the source files for the included classes?
Do you have experience how stable fastutil is in general? I.e., bugs, API changes?
1) In theory, yes. To make significant practical use of fastutil, you have to change the collection types to the primitive versions I think. Otherwise only the generic methods are available and then there will be even more boxing and unboxing. But it definitely is less invasive. Migration to naturals-util can be done incrementally, though, as there are some utility methods to wrap BitSet
into IntSet
. Yes, java doesn't provide BitSet
abstraction. I added an interface doing this to naturals-util (NatBitSet
). My plan is to also add support for roaring bitmaps (which supposedly are even faster than SparseBitSet
).
2) The frequency of releases is partly due to me pushing a lot of stuff, updating fastutil to Java 8. There are some more lingering changes (some more default methods) and potential implementation improvements (specific implementation of the new Java 8 collection default methods), but apart from this I don't expect too many releases. If something is released, it most likely only will be an addition (NavigableSets hopefully ...). The maintainer cares a lot about API stability and backwards compatibility, so I guess we seldom will need to forcefully update.
Ok, thanks for the info. If you don't mind, could you perhaps prepare a pull request with the JAR approach you preferred, i.e., with a selection of fastutil classes that seems reasonable? Preferably including the sources in the JAR. That way we could play around a bit before commiting on a particular approach. It might also make sense to include one or two conversions of existing PRISM classes to using fastutil, to get a feel what's required?
I won't be able to take a closer look for the next 2 weeks, but would look afterwards.
Can you add me as contributor so that I can directly create branches etc? Keeping forks in sync is tedious.
I compiled a short list of different implementations and their use cases:
DebugJDD
. Optional, costs ~300 KiB (!).I think that restricting ourselves to the Object and Int things is fine. Only those are really useful in hot code I guess. Pre- and post-processing can easily do with boxing.
Without the Long stuff, the minimized jar (no sources) is at 1.1 MiB. Removing the Double and 2Boolean should push it down to ~600KiB.
The main question is how restricting we want to be. If we are willing to pay a few megs, we can easily be "future-proof" by including everything dealing with the primary data types int
, long
and double
. This jar is ~5MiB but will most likely only change for new fastutil releases.
Both approaches already are at least semi-automated. The "whitelist" approach needs a vanilla fastutil jar and the whitelist to create the minimized jar, the second can be generated by a make-option I added (make NO_SMALL_TYPES=1 sources
). Moreover, this might be added as a standard release (i.e. fastutil being shipped as fastutil-full and fastutil-min).
Hi @incaseoftrouble. Happy to add you as contributor, but our plan for now was to avoid creating branches on the main repo and use forks for most development. I don't think it's really much more trouble to keep a local copy in sync is it?
@davexparker Sure, that should work, too. I'm just not used to it too much :)
Hey folks, small update: sorry for the delay I was / am busy with travels. Hope that I can get something reasonable pushed this weekend
Hey! I added some primitive adaption in https://github.com/incaseoftrouble/prism/tree/fastutil-adaption
I removed / replaced obvious things in the common
package - these changes had virtually no impact on the other code (in terms of changes). To integrate fastutil a bit further, IntSet
should be migrated. For this, naturals-util
should be used, to get the BitIntSet
classes. These classes provide all the methods of the current IntSet
and more - except the reverse iterator, which I can add quite quickly if needed. This would make IterableBitSet
and IterableStateSet
more or less obsolete and the common
package can be reduced further.
except the reverse iterator, which I can add quite quickly if needed
Did that! See https://github.com/incaseoftrouble/naturals-util/
bump
I'll be able to work a bit more on PRISM again.
bump again
I merged some stuff, code can now be found here and a diff (without whitespace changes) here.
After this is merged there are some more simple, under-the-hood optimizations that can be implemented but I didn't want to pack the change too much. E.g. I implemented a variant of Distribution which has significantly faster access times (O(log n)
via a sorted int[]
-key-array) while using less space.
fastutil (github: https://github.com/vigna/fastutil) is a Java library that provides variants of Java's generic data structures, specialized for the primitive types and thus avoiding boxing ints, doubles, etc into Integers, Doubles, ...
This often improves performance and memory requirements, so it has been used by several groups in their PRISM extensions. We'd thus like to make it available in the main PRISM.
fastutil uses a templating approach to generate the Java source files for various instantiations (
Long2BooleanMap
, ...), which results in a large number of classes, leading to the question of how to package it for PRISM in a smart way - without unnecessarily increasing compile times, download size/time and git repository size. @incaseoftrouble and I had some discussions about the options:lib
. Currently, that would be a 17MB binary blob. It would also change every time we want to include a new fastutil release. To get convenient JavaDocs for the classes in the IDEs, we'd have to include the source files (15MB compressed, 77MB uncompressed) in the JAR.lib
. There are tools that can automatically cull a JAR file to only include those classes that are actually used. That would probably reduce the JAR size quite a bit, because there are a ton of fastutil classes that will probably never be used by PRISM (e.g., all 'short', 'byte' and 'float' specializations). There is then the complication that every time some PRISM developer wants to use some previously unused part of fastutil, we'd have to rebuild the JAR and drop a new binary blob. In particular, that would complicate merging of pull requests.jdeps
tool to find the dependencies and write the corresponding.java
filenames in a text file. This would allow some automatic way to copy/keep exactly the required files. Benefits are that merging PRs, updating to a new fastutil version, importing additional fastutil classes would be regular git operations on the Java sources.If the actual number of files that are needed is reasonable, my personal preference would be for option 3. In that case, I think the following approach makes sense: You'd have one text file that lists the fastutil dependencies that result from the PRISM source, i.e., those classes that are directly imported into PRISM source code. In a seperate step, one could then generate the list of fastutil dependencies for those classes, i.e., the dependencies inside fastutil, and include all these
.java
source files. In addition, it should be possibly to manually add fastutil classes to the first list, e.g., because one knows that those classes are generally useful or used in some branch. It would then be sensible to have some script support to automate the usual tasks (update to a new fastutil version, updating the java files according to the dependencies, ...).