Design a immutable/mutable interface for Java Arrays

typelead / eta

The Eta Programming Language, a dialect of Haskell on the JVM

https://eta-lang.org

BSD 3-Clause "New" or "Revised" License

2.61k stars 141 forks source link

Design a immutable/mutable interface for Java Arrays #940

Open rahulmutt opened 5 years ago

rahulmutt commented 5 years ago

Java arrays are currently mutable by default but it makes them awkward to work with. As suggested by @GordonBGood, it makes sense to adopt an interface similar to the array package: http://hackage.haskell.org/package/array-0.5.3.0/docs/Data-Array-MArray.html

GordonBGood commented 5 years ago

@rahulmutt, yes, I'm glad you see this as the way forward.

Would you like to work on this yourself or would you like to "farm it out"?

If you would like someone else to work on it, where do you see the changes being made? One way would be to make further patches to the array packages we get from hackage to just redirect all the primitive calls to the equivalent Java calls just as are currently made inside the current Java.Array module; that may be the best way as then the array package can be used when required just as it can be optionally done now, just that by using the Java array's as primitives should make it faster.

If we do that, do you think there will be any differences in performance using the standard ST monad instead of the Java monad? Looking at the code I've managed to find so far, the primitives would appear to do the same thing.

If this works out as to speed, I really don't see that there would be a need for the Java.Array module at all other than for backwards compatibility, but of course FFI is still necessary and one could still create a JArray of objects just as shown there. However, as one can only use pre-evaluated lists (and primitives) to transfer information out of such array's, I don't see them as very useful (as discussed previously in issue #934).

rahulmutt commented 5 years ago

I have my hands full with direct interop, so I personally won't be able to work on this anytime soon and I would be happy to assist you in implementing this.

I think patching the array package is the way to go. If anything, the ST monad should be a tad faster - it takes and returns a void value which compiles down to nothing and the Java monad shuffles around a target object value which we have yet to optimize at the bytecode level.

Java.Array will still have its use. Direct interop will assign a canonical type signature for a Java method and we need to be able to assign an Eta type to a Java array. Perhaps we can add conversion functions that can translate the types from Java.Array to those types in the array package.

GordonBGood commented 5 years ago

@rahulmutt, okay, we are agreed on what looks to be the best way forward and I would like to contribute if I can.

I'll take your word for it that Java.Array still has its uses (other than for backward compatibility). As to conversion functions, direct conversion might be somewhat difficult as it is hard to produce types out from under the Java monad without either some "under-the-covers" magic trickery or some monad magic such as transformers, interpreters, etc. and the latter aren't very good for performance. The fall-back conversion method is just to use what is already there: converting a JArray to an Eta list (even though not so good as it will be pre-evaluated as it produces it from the last element), and of course we can then make an IArray (and from there a MArray) by using the constructors using lists. It depends what use you see for the "old style" JArray as to what might be worth doing. If you see these facilities being often used, it might be better to do some primitive manipulation in copying or cloning the underlying Java array for a Java Array to/from an underlying Java array in the array package. One question would be where to do that, as neither the Haskell array package nor the Java.Array module will be loaded unless necessary - which then makes sense to use an available default structure such as a list as a conversion vehicle.

So to get started I modify the array patch? Is there an easy way to make changes to the patch, test them, and loop just using etlas?

rahulmutt commented 5 years ago

If you do make the change to use Java arrays inside of array, direct conversions can be made pretty efficient by using unsafeCoerce correctly because the underlying representation will be the same regardless of whether it's immutable or mutable. I can take care of that part.

Instructions on how to submit a patch are here:

https://github.com/typelead/eta-hackage#patching-a-library

GordonBGood commented 5 years ago

In order to get started on this, I will need a little help from you, I'm afraid.

Although I've done quite a few changes and PR's on Github by forking the repos, cloning them locally, creating a branch, committing the changes desired to that new branch, pushing the commits to my forked repo, then doing a PR based on this, it seems that this Patch process for GHC may be different.

If I understand correctly, the steps to proceed are as follows:

I need to clone the appropriate GHC array version tag locally (looks like 0.5.2.0 is as high as we go, as 0.5.3.0 explicitly says support for GHC < 8.0 has been removed) in order to get a copy of the array model source code
After making a backup of the cloned local copy of the array package source, make changes to one of my local copies to add the capabilities we require.
Create a "patch" file based on the "diff" between the original array package and my modified one in yet another directory; I have found tools that do this.
Then it looks like I can either use etlas with an option "--patches-dir=DIR" to tell it to get the patch from that new directory to load a new version using those patches or backup and change the patch file(s) in my local clone of the eta-hackage repo at "%USER_HOME%\etlas\patches\patches".
Once I'm happy with how it works as per testing, I submit a PR for the changed patch file(s) to the eta-hackage repo.

Do I have it right?

A few questions:

How do I clean old versions of the patched array package from by local caches when I want to try something else? I see that there are quite a few things related to it in my "%USER_HOME%" directory, including at "%USER_HOME%\etlas\eta-0.8.6.4\array-0.5.2.0-Cz0W86mUCERJBKUYgo5Y3t", and as well there are ".tar.tz" package compressed files for each version of eta that is on my machine such as at "%USER_HOME%\etlas\binaries\cdnverify.eta-lang.org\eta-binaries\eta-0.8.6.4\packages"; these contain the ".hi" compiled interface files for the packages. I'm afraid to just delete these as there is may be some database reference to them on which etlas may depend, else the database may be broken
Is there a way to use an etlas command to clean these in order to load anew?
BTW, how does one cause etlas to remove old versions of eta from the local installation. I assume it may be something similar?

If you can give me just a few pointers to get started, the actual work including testing isn't likely more than a few days, as the whole array package isn't that large and we will be changing only a small part of it.

Of course, the very easiest thing to do might be to work at an even lower level and just change the GHC.Prim hook-ups related to primitive arrays, butI understand from GHC.Base that no real file exists; I understand that these are hook-ups that the compiler just "does" and to change hook-ups for instance for primitive byte array ops would require bigger changes? If this were possible, it might mean very few patches would be required at all. But maybe that wouldn't be so good, as the Haskell Arrays only deal with byte arrays and deal with different primitive sizes "primitively" by just unsafe casting, so it wouldn't work so well? I'll start by the patch approach.

rahulmutt commented 5 years ago

Here are the steps:

etlas get array

will get the latest version of array supported by Eta and Etlas will initialize & commit a Git repository at the state at which it was downloaded from Hackage and will apply any existing patch and stage the changes for you automatically (just do cd array-0.5.2.0 && git status to observe it).

Now all you have to do is work on your changes, and once you're satisfied, just commit them and use the git format-patch command as stated in the steps on eta-hackage to create a patch file for your changes.

Now since array doesn't have a test suite, you'll probably want to run your own programs to verify your patch works. You can do so by creating a new Etlas project, add dependency on array in your .cabal file and create a cabal.project file like so:

packages: . $PATH_TO_ARRAY

where you replace $PATH_TO_ARRAY with either an absolute or relative path to folder where you modified array exists. When you do etlas build, it will automatically pickup the local version of the package and use that instead.

rahulmutt commented 5 years ago

Any time you make changes to your local array and run etlas build in your test project, it will automatically recompile the changes and use the up-to-date version.

Also, we did consider using byte[] directly when implementing primitives in GHC.Prim and it didn't work out since it requires the way the primitive APIs are designed does not allow us to. In the very early days we used ByteBuffer's as the underlying representation of Ptr and it lead to bad memory leaks which is why we worked on a sophisticated emulation layer instead.

GordonBGood commented 5 years ago

@rahulmutt, thank's for the steps; with the help of that and the instructions in the Readme file I am on my way.

Your etlas build system is actually quite amazing for a project in its early stages, and Eta's ability to both interface with Java libraries and almost directly use Haskell ones is a very large wealth of resources and I can very much see why you chose to go the Haskell route. This is a huge conglomeration of resources!

I quite see how ByteBuffer's wouldn't work after considering the details of how the Haskell code would see them, and how they would have to support Haskell requirements. I can also see that, while no doubt better, this second evolution of using copy operations to marshal data between Java buffer areas and Haskell expected data isn't going to perform all that will just as experienced. I really think that this proposed solution should be the very best in that the Java byte code will be dealing with arrays just as it is most tuned to do, and the higher order Eta/Haskell code is still going to see arrays just as expected, too. The only thing that still won't work very will is low level Ptr operations, but then they never were going to be very fast and if we do support them in some form it won't be any worse than the current implementation.

Now to get to work and just get it done.

GordonBGood commented 5 years ago

@rahulmutt, I'm into it, and see what you've done, but need a brief discussion, although I don't want to waste your time.

I perused the current state of the code and made a preliminary plan, then slept on it as that produces the best results.

That produced the following results: we can wrap conventional Array's inside IArray's and MArray's with one major likely deal-breaking limit - one can't do castSTUArray without a copy and a copy would destroy the reasons for using this unsafe routine: treating primitive data, as some other kind of primitive data. Now one could perhaps overcome that at a cost in performance with some sort of copy management that would make unsafeFreeze no longer any different than regular freeze in all or just in some cases (when the array to be frozen has been cast to a type different than the original type). This method would then be still an option, and would bypass all of your current primitive operations altogether. However, (when required) these copy operations wouldn't be very fast, as we would likely have to do the copy operations by manually deconstructing the primitives and rebuilding them for each set of possible conversions.

Let's hold that as a possibility at look into the alternatives:

You have already pursued direct use of byte[] and found it not to be satisfactory, and I can see that wouldn't work any better and likely worse than the above plan. In fact, since we are completely bypassing primitives here, it could be made to work here where it didn't before, but there is no advantage over the new plan.
You said you have tried the use of ByteBuffer but had problems with memory leaks but didn't provide many details. One of the criticism's I might make of your implementation of your ByteArray Memory Managaged emulation class is that the direct memory it wraps is pinned where I see no need for that: Although the Haskell primitives do allow for pinned ByteArray#, they are usually only used for the rare case of operations using low-level C-like pointer operations, which we would never use in Java. Therefore, I don't see the need for the allocated memory to be pinned. Why did/do you think it needs to be? If you did it to fulfill the requirements of the Haskell primitive pinned ByteArray, then your ByteArray emulation could be used only for that (which won't ever be used for Java) and do something else where pinned ByteArray is not needed, as the Haskell primitives make an exact distinction between the two.
If allocated memory doesn't need to be pinned, then why allocate it outside of the Java Heap at all? It would be much easier to just let the JVM handle memory with its own GC, at which it is very efficient.
If we are no longer emulating memory management, then the ByteBuffer may become a very good option, as it has been tuned for performance in the JVM and offers the required facilities to get/put primitives of different byte sizes.
If we can use ByteBuffer, then we don't need the inefficient copy operations at all.
By patching the Data.Array.Base module to bypass your primitives, we don't actually have to change the primitives right away or ever, so this can easily be experimented with.
It seems to me that using ByteBuffer by patching is likely a worthwhile experiment unless you see some reason it can't work.
Relative performance between ByteBuffer and Java primitive arrays can be determined beforehand without doing any changes/patches to the modules by a set of benchmarks using each.

So, barring any response from you that you know something different, my plan forward is as follows:

Spend a day or so doing relative benchmarks between ByteBuffer and Java primitive arrays including the relative cost of copy operations between different types.
Depending on which gives the best general performance, bypass all of your primitives by patching the array package to use whichever seems best.
The code will almost certainly not use your ByteArray emulation and only do low level pointer copy operations for the "handle" operations (again which will never be used as they are meant for direct interface to C-type operations). Since it seems your MemoryManager already deals with ByteBuffer, this should work directly to fulfill that contract (even though it will likely never be used).
When this is finished including testing and benchmarking, you might consider changing some of your primitives to support this directly. If that were done, patches required would again go back to the current very minimal.

rahulmutt commented 5 years ago

The plan seems good. I agree that implementing them directly vs using the primitives is the way to go for performance.

To clarify, I said that using ByteBuffer as a direct representation of Addr# (the primitive underlying Ptr a) caused problems because to preserve referential transparency in all the Addr#, we had to duplicate the ByteBuffer which amounts to an allocation on every pointer addition, subtraction, or multiplication which is clearly unacceptable.

To solve that problem, we changed the underling representation of Addr# to Java's long which represented a memory address and worked on a MemoryManager that provides operations to retrieve the values given just a long. The MemoryManager even provides an API to retrieve a ByteBuffer that represents in the entire allocated region pointed to by the long. Pointer operations became super cheap at the cost of pointer dereferences becoming slightly less inefficient (think of a couple bitwise operations + array accesses). Bulk operations on memory are still just as efficient.

A ByteBuffer is just a interface that allows reading and writing bytes in a structured way. You can wrap a byte[] into a ByteBuffer or you can wrap native memory into a ByteBuffer via ByteBuffer.allocateDirect. The MemoryManager supports both and has distinct address spaces for heap managed memory & direct native memory depending on whether the unpinned or pinned memory was requested. It takes care of keeping track of free regions and responding to allocation requests accordingly reusing free regions when it can. Just think of it as a malloc implementation on the JVM.

GordonBGood commented 5 years ago

@rahulmutt, it looks like using ByteBuffer and primitive-sized views into it adds only very little overhead as compared to using primitive Java arrays directly and it makes it easy to support casting via STUArray to another primitive type.

It is somewhat complicated to sync writes to various STUArray's that have been cast to other primitive types which may each be modified at different times if they were based on standard primitive Java arrays, even though this use in Java is likely to be fairly rare. If we have "dirty" flags and a backing store (likely just a ByteBuffer), checking and syncing these stores will add more than the overhead we would have gained in not using ByteBuffer in the first place and the code would be quite complex.

The best thing I can come up with in order to use primitive Java arrays directly is to define a wrapping data type replacing bytearray# which can be an optional type covering the possibilities of being one of each of the Java primitive arrays directly if the STUArray has ever been cloned, changed to a view of a ByteBuffer of one of those same primitive types when it is cast with all subsequent using casts or copys using the ByteBuffer option. In the common case, these are directly used; in the less common case of being cast then they have the overhead of using the embedded shared ByteBuffer reference until such time as they are frozen in which case the ByteBuffer is copied to a regular Java primitive array. This shouldn't have an execution cost if they aren't cast, as the compiler should recognize their unmodified state and apply the direct access methods as usual; there will then only be a cost in the uncommon case that they are cast, but I'm worried about avoiding having to do a run time check on whether the array has been cast or not before being able to operate on the data. which would add overhead and overcome the benefit.

This wouldn't affect IArray's which can't be cast and therefore can use standard Java arrays directly and which would have the underlying ByteBuffer already copied back to a standard Java primitive array on freeze if the creating STUArray had been cast.

If we went that way, we would have to add documentation to advise against the use of castSTUArray without considering the performance cost: one of the most common usual reasons for its use is to improve processing speed by, for instance, doing popcount 64 bits at a time instead of 8 and this may or may not still be effective; another reason is to simulate simple byte packed record primitive operations without defining a new instance of STUArray and again this may or may not be effective.

I would like to have a "raw" way of doing fast memory copy operations as a static "copymem" method, but although the DotNet CLI/CLR has an unsafe Marshal class that provides this, I don't know of anything equivalent in Java.

In the end, since use of ByteBuffer views works so well, I say "why go to all of this complexity?".

My current plan once I finish a little more performance testing is to implement the array patches only using ByteBuffer to see where we are as to speed, then once that is working to consider refactoring to the common cases where we don't need ByteBuffer but can use direct Java primitive arrays, with retaining the ByteBuffer code for the "cast" cases when we need it. If supporting both casting and direct Java array use has a performance cost (or as I anticipate, little performance gain), we will have to evaluate which road to take then.

This has gone in a considerably different direction that I had thought when I started!

rahulmutt commented 5 years ago

Thanks for the progress update. At the end, we'll let the benchmarks decide which way to go.

In terms of efficient copying, System.arraycopy is what you're looking for. Unfortunately, ByteBuffer doesn't appear to have an equally efficient API.

GordonBGood commented 5 years ago

@rahulmutt, yes on letting the benchmarks decide...

I had found System.arraycopy but it can only copy between arrays of the same type, so a cast copy would require manual conversion. DotNet CLI/CLR Marshal class has a very low level copy by bytes from pointers (IntPtr) meant for use in interop with C code which drops right down to the lowest level, but is very fast.

Since I think that ByteBuffer will work out quite well for the UArray's and without all the complexities; with it there won't be much copying required as one can just change the view type, and when there is copying required it would be of the whole structure for which there are quite a few tools.

To get this done, I've had to bypass all the primitives to do with Array#, ByteArray# and MutableByteArray# which means replacing quite a lot of the functionality of GHC.Arr. In the interests of saving compile time, I've changed to developing the interface outside of the array-0.5.2.0 folder, test it, and then just patch it in using the procedure that you recommend. Seeing how buried this is in GHC.Prim makes me realize how much easier it would be if the primops where changed as per this new interface, retaining your emulation layer when working with addr#, pinned arrays, etc., and using this for the better than ten times speed increase otherwise. Anyway, we'll see what you think when it's done.

rahulmutt commented 5 years ago

You can try sun.misc.Unsafe which is probably the closest equivalent to Marshal. The Eta runtime already includes a utility to fetch an Unsafe instance which you can probably re-use. It unfortunately does not have clear documentation because it's not supposed to be used (though many widely used Java libraries do use it) so you can see the source or browse the web for blog posts.

You can actually define subclass of ByteBuffer that keeps track of the underlying type via an enum or an int and use the subclass throughout. If the type checks can be pushed to the Java side, it might be more efficient.

Yes, I'm open for adding new primitives into the compiler - I just need to get an idea of the type signatures required, so let's see how that pans out first. In the long run, it will be more efficient since the compiler will treat it as an intrinsic and direct write out the bytecode, avoiding the minor overhead of a foreign call.

GordonBGood commented 5 years ago

@rahulmutt, If we were to use the sun.misc.Unsafe class for fast byte copies, isn't there some risk that it might not be available on some platforms such as OpenJDK or especially more specialized ones such as Dalvik or whatever Google eventually uses to replace it? Also, there is some risk that it will change across the platforms as it isn't part of the specification. It appears to be difficult to use and I would rather not go that way. So far, and hoping the use of ByteBuffer works out, there hasn't been any need to do copies that can't be very quickly done just by cloning the backing byte[] store, which will be optimized by the JVM itself and thus very efficient.

As to adding sub classes of ByteBuffer for representations by different types, since I am using ByteBuffer as the backing store only for the "Unboxed" Array types and since there are already instances of each UArray type, the Eta compiler already well knows the type and the function parameter types that work with it, so I don't see a need to add anything. Haskell/Eta have stronger type checking than Java and since this checking is done at compile time, any type checking won't affect performance (actually, your suggested method could cost some performance in checking the buried type option as if it were a dynamic type), I'll leave this one alone for now, too.

I am already using your primitives when I can: I use the primitives to do with jobjectArray# for working with the Java Object[] array I use with boxed Arrays, and would like to have equivalent primitives to work with ByteBuffer more directly (there will be considerably more primitives to support ByteBuffer as it does so many more things!). I agree with you that this should make it even more efficient.

As to progress, I am coming close to completing work on the Boxed versions of the arrays, but the Unboxed versions may take a little longer since I'll have to invent more things as to the interface for ByteBuffer.

GordonBGood commented 5 years ago

@rahulmutt, as to primitives, the very best way would to make the "standard" Haskell/Eta primitives just do what we are doing built on top of them or in by-passing them; in this way, the compiler would be only emitting Java byte codes for the whole works and it should definitely work faster. It also has the advantage that more Hackage packages won't need to be patched at all but will work "straight out of the box".

GordonBGood commented 5 years ago

@rahulmutt, I am wondering if you can help me:

I've emulated primitives and implemented arrays using standard Java Object arrays and have used your jobjectArrayNew#, jobjectArraySet#, jobjectArrayAt#, and alength# primitives to do it, thinking that they access the low level byte codes and should be faster. Everything type checks and compiles except for one thing: when object arrays are first created, they are filled with arrEleBottom, which is purposely untyped so as to be able to be used with any type of array. With real primitives, type checking would be ignored for these and everything just works, but with my emulated primitives written within Eta calling your primitives, everything must type check as these primitives expect definite object types and these leave leave ambiguity with confusion over what type of object array to create.

There are a few solutions, but not ideal as follows:

Use some called FFI Java code to write these out as plain untyped raw Objects which would be compatible with any sub type.
Use the reflection interface through the Array static type to put this into place on initialization, but I'm worried the reflection might be slow.
Use a Java interface of some sort to fill the Object array with nulls (actually, I believe that's automatic on creation) and then use code to set the standard error message on null is detected, but that takes exception processing through Java.

None of these seriously bother the work if they aren't too slow, but they are a concern.

The other thought that just came to me is that although unboxed primitive arrays are slow because they use the memory allocation emulation interface, there would be no need to emulate that interface for normal Boxed Arrays and they may be of normal speed. I haven't gotten around to testing this but I'm sure you remember how they work and can advise me on this.

For now, I'm not going to fill the arrays and just leave them full of nulls by default and will worry about this later.

GordonBGood commented 5 years ago

@rahulmutt, I've done some more testing and believe I see that your "primitives" for boxed arrays just uses your Java.Array module so should be about the same speed as Java using object arrays, which (of course) is slower than using unboxed primitive arrays - I measure about 6.5 times slower.

So I'll just let normal arrays default to the normal behavior. However, testing has revealed a problem in your Java.Array module in that due to the way your Java monad is defined it breaks the compiler's ability to do proper Tail Call Optimization (TCO) and the resulting stack of thunks results in SO for code that would normally run. I'm not sure that I'm equipped to handle this as part of my project here, and you said that your project at this time is improving FFI, so perhaps you could look into it? Perhaps it is just lack of rewrite rules for your monad as compared to the usual standard Haskell ones? Or are there differences in using runRW# and needing to run extra transformers when operating under the extra level of the ST monad? I am thinking that the "primitives" shouldn't be dealing with other monads or the runRW# transformer at all as they already track mutability through the ST monad system.

I think that the entire FFI system needs to be overhauled, but you know that and that is what you are doing.

To cause the SO, just change my primes benchmark to use STArray instead of STUArray along with the change to use runSTArray instead of runSTUArray. Expected behavior is the same result as before just slower, but I think you'll find SO.

I'm going to set this part of the project aside and work only on the unboxed arrays.

rahulmutt commented 5 years ago

@GordonBGood If you can file a separate issue with a minimal reproduction example, I can look into the SOE. The runRW# is to prevent the compiler from lifting the computation out which caused weird bugs in the past (#171).

The FFI system is being overhauled right now as #647 is being implemented.

GordonBGood commented 5 years ago

@rahulmutt, Re: the occurring SOE calling Arrays using your primitives that hook into the Java.Array module, I'm starting to think that it may be a manifestation of the same problem as I had before in issue #934 to do with the latest version 0.8.6b4 having changed strictness behavior and the SOE occurring due to the stacked evaluation Thunk's, and am trying to prove it one way or the other. If it is something new, I'll open a new issue, but if is the same problem, I'll re-open that issue. At least you have been made aware that there is something...

Yes, I see the need for something like runRW# but wonder if there might be another way. I'll try to investigate...

For the record, I did test the use of the reflection interface Array.set/get and as for most uses of reflection, it is dreadfully slow and not useful for our application.

I've concluded that the best option for IArray's is the primitives you already have and will just work with you in testing to see what their limitations or problems are.

So this project is now just to improve the Unboxed versions of UArray/STUArray.

GordonBGood commented 5 years ago

@rahulmutt, as to the strictness problem I had before, I can't repeat it in a small few line example and it only occurs when I use it as a benchmark where the counting is backed by the construction of a data structure than may optionally be lazy unless forced as when I mentioned it, so I have held off reporting an issue. In fact, as per the reason I closed the issue before was that the behavior is essentially the same as it would be in Haskell, and therefore not really an issue, just unexpected. I could open another issue mentioning it and posting the code that causes it if you would like though, but I don't want to over-report issues that aren't really our concern.

As to this second evaluation problem where using STArray is less strict than using STUArray, I just checked and its the same in GHC Haskell so we can't fault Eta there either. It does make it harder to deal with the STArray type unless we "cheat" and use the Strict extension, though.

So back to the main problem: while I was checking out the above, I did a few more benchmarks and it turns out that even though you may be using Java Object Arrays under the covers of the "primops", Eta is still slower at using STArray than Object Arrays in Java by a factor of six or more, so it still needs some work. And I reconfirmed that Eta is over ten times slower at the use of STUArray than using primitive arrays in Java, so that's even worse.

I'm back to the original plan, which may take a little more time than expected...

GordonBGood commented 5 years ago

@rahulmutt, on the subject of your implementation of pinned addr# using a wrapped version of ByteBuffer, why do you need to emulate a pinned ByteArray or memory outside of the Java GC'ed heap? The only reason for it that I can see is for JNI in interfacing with C-like languages, which is very slow and seemingly meant for "bulk" compatibility for large increments of work or data to cross between Java and those languages, as Java itself has no other provision for direct reference to memory that I know. Does Eta support JNI?

The reason I ask is that I am using this project as a test case to try to convince you to replace all the primitives for Array#, ByteArray#, MutableByteArray# (and likely the related SmallArray# and ArrayArray#) with my new faster primitives, and the only relationship between Addr# and these is that one can obtain an addr# with byteArrayContents# and then use it to access the byte array. The note says that the byte array should be pinned, but in this case it is just a byte index into the byte[] data store so wouldn't need to be.

As I said, one would need pinned addresses if one were passing them to JNI and a means to access "pinned" arrays passed in from JNI (if they weren't copied in) but that would be it and could be handled separately by having a couple of types of addr# if necessary, even if it was not so efficient since JNI isn't very efficient itself.

So, can Eta use JNI?

rahulmutt commented 5 years ago

Honestly, I just wanted to preserve semantics in the beginning and then the goals changed to just doing what's right for Eta and deviating whenever it made sense. This may be one of those cases where it may be useful to deviate. We do want to add support for JNI at some point in the future, but not a huge priority right now.

Try and this experiment and see how this affects performance:

You can disable all uses of native memory by changing this line: https://github.com/typelead/eta/blob/master/rts/src/main/java/eta/runtime/io/MemoryManager.java#L59 to

long address = globalManagedHeap.allocateBuffer(n, false, Capability.getLocal());

and then running eta-build rts-clean so that the build system can recognize the new RTS. You need to build eta from source to be able to do this if you haven't already.

GordonBGood commented 5 years ago

@rahulmutt, I know how projects like this go, and how you BDFL(s) are so buried into it that sometimes things get implemented just because they happen; it's what I bring to projects: the ability to step back and look at the low level implications and envision ways to improve them - I'll likely never get to your level of expertise at the high level of type families, type classes, monadic code and all the other sub variations, implications of SemiGroup's, knowledge of all the available GHC extensions, etc., etc. :), but I have 50 years of experience at the low level with some (enough maybe) knowledge of the higher abstractions.

No, I haven't built eta yet but I'll get around to it after this project. Or are you suggesting that building eta without direct memory support might fix all the performance woes and make this project unnecessary? I've cloned the Eta source files on my machine so I have a local copy already...

I'm amazed a what you have accomplished in, what, a year and a half or two? The build system is quite polished, documentation isn't that bad although it perhaps needs some updating (also depends on Haskell documentation, so you are lucky that way)...

Anyway, my main interest is performance so back to work!

In thinking more about the performance implications of turning direct memory off, although it will make SOME difference (which is confirmed by the Java docs), I don't think that it is the major cost in that I think that the many nested function calls in the indirection are likely the major cost.

My main question for today is whether your Array# primitives use the same emulation as ByteArray#'s do or do they call Java Object arrays directly? You can likely answer me very quickly without me spending more time digging though your code.

If you don't use them directly, that probably explains the huge performance loss for Array# as well as ByteArray#. I'm building my mutable primitives inside the ST monad instead of inside the Java one to avoid the extra indirection and it will be up to the programmer to avoid shared modifications of unsafe thawed arrays by separate threads, just as for Haskell. If I'm right...

rahulmutt commented 5 years ago

We re-use existing stuff from the Haskell community when it doesn't make sense to re-invent the wheel and we make tweaks to take into account the Java ecosystem aspect. Etlas is just a fork of Cabal which is why it's so smooth. Another contributing factor is that is incredibly easy to debug any weird bugs since the JVM provides great introspection mechanisms. GHC on the other hand has to do debugging of native binaries which is a lot more time consuming.

I meant do a build from source and see whether turning off direct memory causes a sizeable difference. If it does, then we can disable it by default and make a runtime flag for those people who want it.

But yes, I agree that it's more than just the difference of native vs heap that's causing the perf difference.

There are a couple of ways to get a real sense for where the perf loss is coming from:

STG Code - Add this to your cabal file indented under your executable section: eta-options: -ddump-stg -dsuppress-all -ddump-to-file Then, once you run etlas build, run find dist -name "*.dump-stg" to find the generated STG dump files. You can view them with a normal text editor and it can be a bit hard to read but it's just a really simplified functional language without all the syntactic sugar cut out. You essentially have let statements which allocate a new thunk, and case statements which evaluate expressions - can correspond to a direct function call or a thunk evaluation depending on the structure of the statement. STG code is the last IR before compiling to Java bytecodes. Note that the names in STG are important because they are using to name the compiled class files.
Failure of Rule Firings - The array library uses a lot of RULES so perhaps for whatever reason it's failing to fire. You can debug this by adding this to your cabal file: eta-options: -ddump-rule-firings -ddump-rule-rewrites -ddump-to-file
Decompiling to Java - Using tools like Bytecode Viewer you can directly decompile an Eta jar file and look at what the Java code would have looked like. You can find the jar again by using find dist -name "*.jar"
Analyzing JIT Compilation - You can have the JVM dump information about inlining/compilation of hot methods and using tools like JITWatch to figure out if anything can be made better. You can also coax the JVM to give you the assembly code if that helps.

And in general, you can just profile and a mixture of all the methods above to figure out what's going on and where there is room for improvement.

GordonBGood commented 5 years ago

@rahulmutt, I had never used cabal much, just the ghc compiler, so didn't recognize it in Etlas. I see in the Eta repo that you have been able to reuse the majority of the GHC code. Yesterday and last night I went digging for where the performance losses come from and got lost in the i Prim primitives implementation of the STG code, kind of understand the process but not enough to see how it emulates the primitives to do with array access. If this project works out as to increasing the performance to do with arrays, I would like to do a PR changing the primitives for your consideration, but it seems I don't have enough knowledge to do that yet.

I believe that you have tried to be too faithful to the emulation of the relationship between GHC and underlying C code that doesn't cross over to the Java interface very well as to performance even though it works; I think it is something like your versions per 0.0.9 using ByteBuffer directly which you have since bypassed and expect the things introduced by my project here to improve things yet again. I don't propose entirely replacing your direct memory allocation tools or implementation as I think that may be the only way to provide JNI capability, but rather to provide other faster ways when that capability is not required.

I am somewhat familiar with the viewing and analyzing the STG code as I once worked with SPJ himself on a couple of GHC issues and he walked me though it.

However, I mentioned that I'm not really a Java guy, and the help you have given in recommending Java/JVM tools has been invaluable, but I'm still learning just enough to get by. Anyway, I'm hoping that the benchmarks will tell the tale, and any small tweaks that might be revealed by the tools, you will be able to help.

BTW, in your last paragraph you talk about making the JVM spit out the assembly with hldis, a process that I had read about before but have never used as it seems quite involved, but you also mentioned using JitWatch, whose website says "To see how your source code is translated into bytecode and assembly.". Is that assembly different than the long and involved process via hldis (the other process)?

Yesterday, I spent quite a lot of time digging into the Eta repo, and also looking into why types defined to call FFI Java classes have "nominal" instead of "representational" roles. I didn't really want to have to change much in the Eta code base, but in order to bypass the current GHC primitives, I am having to deal with GHC.IOArray and especially GHC.Arr which otherwise would call them, which means understanding the process at least a little bit. Currently, I'm just "hooking" these two modules to bypass those parts I need to have done by this alternate method.

Anyway, I'm back to the main project again now.

GordonBGood commented 5 years ago

@rahulmutt, I'm sure you can give me some quick advice: I'm testing the module but it takes quite some time to run "etlas clean" and then "etlas run" when there are changes to packages other than the testing project as you taught me how to do, so I've tried to move the Java files and some core functionality outside the "array" package to test it separately, then add them back in once they are working.

I've successfully created a sub folder and put the Eta test source code into them and successfully run that just by adding the files to the "extra-source-files:" in the project ".cabal" file; I've also created a "java" subfolder and put the Java code to be tested in there and added a "java-sources:" under "executable" in the cabal file. I've tried to model all of this after the basics interop documentation.

First, I'm not sure what package name to put at the top; I've tried "eta.test" but Java complains that the package doesn't match the folder structure. If I put the file in an acceptable folder structure, how do I import it using FFI? I've tried to import a static method as "eta.test.Test.doit", where "Test" is the class in the Java file and "doit" is a static method in the class, but it never finds the class, no matter whether I use the package matching the folder structure or not. I've wasted quite a lot of time on this.

Can you give me a short example including folder structure, "cabal" file requirements, and "foreign import" requirement line in order to do this?

GordonBGood commented 5 years ago

@rahulmutt, I've been following up on something you said something like "startup/warmup time for Eta is longer than for Java". I expected it to be slower just as Scala is slower due to time to start the runtime library, which for Scala is quite extensive as I assumed that Eta would also be. However, it is really, really slow. Once this slow array thing is fixed, the slow startup time may be the thing that prevents people from switching from Java or Kotlin to Eta.

Kotlin startup time on my admittedly slow machine is still something under a second, with Java a little faster. Eta startup time approaches 10 seconds! This is running a stripped "uberjar" directly from Java.

Is there anything that can be done now or in the future to reduce this?

rahulmutt commented 5 years ago

I'm not understanding why java complains about the folder structure.

Say you have: java/Utils.java

package eta.array.test;

public class Utils {
   public static void someMethod() {...}
}

You can put java-sources: java/Utils.java and in for importing you can do:

foreign import java unsafe "@static eta.array.test.Utils.someMethod" someMethod :: Java a ()

One way to reduce the startup time is to switch over to invokedynamic-based code generation to lazily generate classes for thunks and function closures, but invokedynamic has its own startup time issues which we need to take a look at and see. Or maybe there's another factor affecting startup time that we need to take a deep look at. Again, filing an isolated issue with an example that demonstrates this would be useful.

How did you create a stripped "uberjar"? Did you run Proguard yourself? By default, uberjars generated by the compiler or Etlas are not stripped.

GordonBGood commented 5 years ago

@rahulmutt, It may be just my editor Java plug-in (VSCode), but it flags a statement such as package eta.array.test; as an error unless the "Utils.java" file is inside a java/eta/array/test/ folder structure. I have tried both ignoring the error or building the requested folder structure so the error flag goes away with the same results (I change the cable 'java-sources:` to match whatever the folder structure is), so think that this may be just an unimportant side issue. The interesting thing is that when I view the "Util2.java" file inside the "array" package, the plug-in doesn't flag any error.

Then, in my "Main.hs" file I put the foreign import statement just as you recommend and call java someMethod in the main function, and the program compiles successfully but when run it reports a "undefined class error can't find eta.array.test.Util" or something to that effect no matter what I try. It looks like I'll have to send you a zipped file of a minimum project to see what you think.

As to startup time, I decided to try the "uberjar" and "stripped" (by "stripped" I just mean the "--enable-executable-stripping" option) to see if distribution sizes are reduced or what they are like compared to using Clojure, with which I have some experience. I didn't really worry about the "stripped" option as the output ".jar" files aren't all that large, even as an "uberjar" (means standalone, right?). And the reason I tried them is to see if startup time would be reduced, which it wasn't. Ever since I started using Eta, startup time has been much longer than an incremental build, and almost as long as a full build cycle after a clean. I'll file an isolated issue, likely with the same sort of minimal project as for the FFI problem described earlier. I'll upload it first here related to FFI to see what you think, and then open other issues as you recommend.

The project compiles successfully, but when run produces the following error and takes almost six seconds to startup on my machine, even without the function that calls the someMethod or the foreign import (ie. a simple hello world program):

Exception in thread "main" java.lang.NoClassDefFoundError: eta/array/test/Utils
        at main.Main.$wa(Main.hs:10)
        at main.Main.main1(Main.hs)
        at main.Main$main1.applyV(Main.hs)
        at eta.runtime.exception.Exception.catch_(Exception.java:132)
        at main.Main.main3(Main.hs)
        at main.Main.DZCmain(Main.hs:9)
        at main.Main$DZCmain.applyV(Main.hs:9)
        at eta.runtime.stg.Closures$EvalLazyIO.enter(Closures.java:154)
        at eta.runtime.stg.Capability.schedule(Capability.java:254)
        at eta.runtime.stg.Capability.scheduleClosure(Capability.java:210)
        at eta.runtime.Runtime.evalLazyIO(Runtime.java:397)
        at eta.runtime.Runtime.main(Runtime.java:390)
        at eta.main.main(Unknown Source)
Caused by: java.lang.ClassNotFoundException: eta.array.test.Utils
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        ... 13 more

Minimal FFI project file zip attached: eta-ffi-helloworld.zip

rahulmutt commented 5 years ago

I am not able to reproduce that problem with Java 8 - I have a feeling it's the bug Eta has with Java versions > 8. Can you file this as a separate issue so that we can fix it? Thanks.

The suggestion solution is to install Java 8 and use an environment manager like jenv in case you want to use the recent Java versions.

rahulmutt commented 5 years ago

--enable-executable-stripping is currently a no-op. We eventually need to implement it using proguard.

GordonBGood commented 5 years ago

@rahulmutt, as to startup time, I think Scala suffers especially from this because of dependence on a particularly complex library so that startup time may be just as bad as I am experiencing with Eta, and which may be what the problem is here. Kotlin is better because it doesn't have a very large library and depends mostly on Java.

Thanks for the info on the stripping option; not a worry, I was just curious...

I'll take your advice and install JDK-8; I have no real need for JDK-11 other than I like to try out the latest features, but I don't really "do" Java and I don't think Kotlin supports the latest either (although it also doesn't crash when run on it). So I'll just uninstall JDK-11 to make it simple.

And I'll post the issue about it once I confirm that the problem goes away with JDK-8.

EDIT_ADD: JDK-8 did the trick and I'm off. I'll post the new issue about incompatibility with higher version JDK's within a day.

Thank you so much.

GordonBGood commented 5 years ago

@rahulmutt, I've gotten far enough with the Boxed Array's to show that they are about as fast as other JVM languages and thus about 5.5 times faster than they were before (although about six and a half times slower than Unboxed arrays, just as for Java). However, they aren't production ready as I am fighting the type checker to make them easy to use. As you know, unlike Java which isn't so strict about types and auto converts between boxed types and primitives, in Haskell/Eta they are distinct types that must be converted.

To make this a little easier, I've created a type class and then declare instances for Class/Objects and each of the primitive types (Bool, Char, Double, Float, Word, Word8, Word16, Word32, Word64, Int, Int8, Int16, Int32, Int64, Ptr, FunPtr, and StablePtr for about 17 types). That makes for a lot of boilerplate for all the instance implementations if I cover them all. Usual GHC is easy because the primitives are based on C which isn't all that type strict and one can easily do static casts between them for the implementation; it may be a little more difficult in JVM Byte code but probably not too bad at the low level; in emulating this I am having to call through FFI for which one must use proper type signatures (although for this one can assign different calling type signatures to the same Java Object Array class methods).

Anyway, if you think that you will use this for production, I may need to use a code generator for the boilerplate (yes, Template Haskell or other meta programming would do it, but likely take me even longer to get working). But I would rather see this work incorporated as primitives in Eta itself so that these facilities are available to all and so that there would be almost no work in patching libraries to do with arrays, yet they will run at Java speeds.

I'll have to go through the same kind of boilerplate for the Unboxed primitive arrays, as they have the same 17 primitive types although I don't have to worry about the boxed case and the implementation should be a little simpler using ByteBuffer.

So what do you think; should I make everything production ready in case it is a while before they get converted to primitive ops or just do a "proof of concept"?

GordonBGood commented 5 years ago

@rahulmutt, last night I did some digging into Eta source and I see at last why Boxed arrays are so slow: After your statement in a post above that "STG code is the last IR before compiling to Java bytecodes", I see that the Java boxed object arrays are implemented using the enclosing closures and then using them directly without some of the optimizations that GHC does in "lambda lifting" to eliminate these closures before they get emitted into the compiled byte code (native code after further optimization by the backend in GHC). The application of Closures is very slow, and will change a simple array access of a few machine clock cycles to many tens of machine cycles.

So if we want native Java speed handling either object or native primitive arrays in Eta, we need to find a way to eliminate the closures. That won't be so easy, as the GHC compiler uses them for STG structure and they are deeply buried in STG code so it would be better to deal with them as the Java.Array library does - by converting the passed in closures to plain Java objects as they are passed in (as for a write) and generating the required closures when necessary when they are passed out (as for read).

Of course, the reason that the current primitives don't have to deal (much) with actual types in the object arrays is the actual types are buried in the containing Closure that is actually stored; if we don't use closures anymore then we'll have to deal with how to handle the different types just as we have to to use object arrays in the Array FFI interface and as I do in the work I've done so far. It seems that this is unavoidable for the speed gains, unless you can see another way.

I'm still thinking about how the interface to stg that expects stgClosure could be refactored simply, but perhaps just by a simple redefinition of what stgClosure is and the functions that deal with it in the CodeGen and Runtime? You likely know the answer right away as it appears you wrote all of this Java code.

As I said before, if we can refactor the Array primitive ops, then everything built on top of them "just works" without all kinds of patching, and even if the handling of the various potential types gets quite complex, at least it is only done in one place.

What do you think?

GordonBGood commented 5 years ago

@rahulmutt, you might wonder why I want to change the primitive ops for arrays so much: The reason is that while we can "hook" the calls from GHC.Arr and minor calls from one or two other modules so that they call my new primitive op emulations which will work for a specific package such as "array", anything or other packages that calls GHC.Arr or primitive ops directly will still see the current definition of those primitive ops; I am thinking of packages such as "vector", etc. By replacing all of the primitive array ops with these new faster versions, GHC.Arr (and the other minor hooks for other modules) will not have to be modified (at all, one hopes, if we can make Array# and ByteArray# have representational role again), and all of these packages and calls will work automatically with very minor patches if any. That would seem to be a lot easier to maintain, although we may experience a few problems at first.

rahulmutt commented 5 years ago

eta.runtime.stg.Closure is superclass of all lazily-evaluated values that are generated by Eta. The hierachy is arranged in such a way that eta.runtime.stg.Thunk is a subclass as well as eta.runtime.stg.Value is a subclass - meaning you can pass a direct value OR a thunk that returns a value for a eta.runtime.stg.Closure.

Can you create a repository for you work so that I can take a look at the code you've written so far and give you feedback accordingly? I have no issues with adjusting the primitives - my only concern is breakage. We'd have to rewrite parts of the core libraries if we change the type signatures of the primitives, which I'm not opposed to as long as we get nontrivial performance differences.

GordonBGood commented 5 years ago

@rahulmutt, Thanks for the hints on the relationships between eta.runtime.stg.Closure, eta.runtime.stg.Thunk, and eta.runtime.stg.Value; that's reassuring that it is easy to get a direct value rather than a thunk. That investigation is to do with what the culmination of this project should be: integration into the Eta primitive ops.

I assure you that the performance differences we will get once this is completed are nontrivial: about 5.5 times faster for boxed number arrays and over 10 times faster for unboxed primitive arrays.

I am hoping that, if we change the underlying Java for the primitives to accomplish this, then we won't have to change the primitive op signatures at all, so there should be less chance of breakage; the problems I anticipate are to do with the difference between those functions that can deal only with Java classes such as the Array# primitives and ByteArray#'s when they aren't pinned or passed to foreign interfaces as JNI and when they are. My current implementation of emulated primitive ops using only the tools at hand in Haskell/Eta that I know and FFI cause problems in the basic Eta modules such as GHC.Arr in that I had to change the Array and STArray classes to contain arrays from roles as "representational" to "nominal" since that is what is produced by FFI wrapper types and that functions require constraints on the type signatures for functions that call these emulated primitive ops with both of these types of changes not a current requirement for the "real" primitive ops, so I've had to "hook" a copy of GHC.Arr to call these, which only works for things that I can cause to call this new version of GHC.Arr such as a patched version of the array package; I WOULD like you to look at the code in case I'm missing something. I'm hoping that the GHC.Arr code won't need to be touched or "hooked" once we have primitive ops the link to the new underlying Array model.

I'm having trouble getting what I'm testing now to compile due to these type difficulties, but I'll post a copy of the primitive ops module and the Java code it calls now for you to look at (attached below); when I get it to compile I'll push it to a Github repo. You'll see that I've tried to design functions that look like the primitive ops but unfortunately calling them requires the changes and constraints as outlined above.

The single thing preventing me from compiling currently is the Functor definition in my "hooked" GHC.Arr module, listed as follows:

instance Functor (Array i) where
  fmap :: forall i a b. JObjectArray a => JObjectArray b => (a -> b) -> Array i a -> Array i b
  fmap = amap

where JObjectArray e is a class type and I've had to enable the extension InstanceSigs to get it to accept the instance signature. I get the error:

JavaGHC\Arr.hs:433:11: error:
    No instance for (JObjectArray a)
    Possible fix:
      add (JObjectArray a) to the context of
        the type signature for fmap :: (a -> b) -> Array i a -> Array i b
    When checking that:
        forall i i1 a b.
        (JObjectArray a, JObjectArray b) =>
        (a -> b) -> Array i1 a -> Array i1 b
      is more polymorphic than:
        forall i a b. (a -> b) -> Array i a -> Array i b
    When checking that instance signature for `fmap'
      is more general than its signature in the class
      Instance sig: forall i i1 a b.
                    (JObjectArray a, JObjectArray b) =>
                    (a -> b) -> Array i1 a -> Array i1 b
         Class sig: forall i a b. (a -> b) -> Array i a -> Array i b
    In the instance declaration for `Functor (Array i)'
    |
433 |   fmap :: forall i a b. JObjectArray a => JObjectArray b => (a -> b) -> Array i a -> Array i b
    |

If I can get this to compile, I can get back to testing...

The primitive emulation Java code and the primitive ops emulation code that calls it through FFI:
Primops.zip

Note that the above contained ObjectArray.java file contains an artifact of the inner Immutable class that I had intended to represent the immutable Array# as opposed to the mutable MutableArray# but it's no longer necessary as both just refer to the same ObjectArray under different names. I'll remove it.

GordonBGood commented 5 years ago

@rahulmutt; Whoops, my bad as far as thinking that your normal Array primitives were slow; I think I got messed up with partial laziness again even though I ran the benchmark with the Strict extension! It seems that even with that extension, one still has to add some strictness help for structures such as Array that were built to be used lazily (something like lazy lists, I suppose).

I rewrote the benchmarks to compare the speeds of using JObjectArray from your FFI module, calling your current primitives directly, and calling your primitives through the GHC.Arr module (to be sure the rules are firing correctly). The results are that there are fairly inconsequential differences between them, with the Java FFI just slightly slower than calling your primitive directly (likely due to the overhead of using your Java monad) and calling through GHC.Arr about 10% slower than the others, likely due to the extra indirection in calls and perhaps some rules that need tweaking. The project proving this is posted here: eta-prims-benchmarks.zip

So I've wasted some time designing array primitives that didn't need to be changed. However, not all is lost as the Unboxed Array primitives truly are over ten times slower than they need to be and the knowledge I've gained in researching the above should make my recommendations on how to fix these better, as well as make me a better contributor to your overall project(s). Hey, I'm retired so I've got lot's of time; I regret if I've wasted some of yours.

The next thing to do is to use similar benchmarks as here to reconfirm for everyone's benefit that the Unboxed array primitives truly are slow, which I am about to do. My next post here will verify this.

GordonBGood commented 5 years ago

@rahulmutt, as I've already said, I'm no Java expert. However, proper analysis of this slow use of the buried ByteBuffer's used in emulating Haskell/Eta ByteArray# and MutableByteArray# made me curious about the actual Java use of this class. So last night I thoroughly investigated it...

This was triggered by a strange anomaly: since I regressed to use of JDK-1.8 after the problem in compiling using Eta with higher versions, my use of non-direct ByteBuffer got very slow. I then translated the benchmark I'm using directly to Java using ByteBuffer and discovered the following:

Although Oracle recommends not using direct memory allocations in general, citing slower performance, this slow performance is only in increased overhead in allocations and de-allocations and actual use of the underlying memory is at least as fast.
For Java 1.8, NOT using direct memory allocations is actually very slow as the JVM seems to do direct byte manipulations on the underlying backing byte[] array just as we would ourselves; it helps a few tens of percent to assign ByteOrder to nativeOrder, which is something that you don't do (recommended) but that doesn't solve the major slowdown.
As long as we are stuck with using Java 1.8, I recommend that we ALWAYS USE DIRECT MEMORY ALLOCATION, as that can be about four times faster in use than non-direct allocation (along with nativeOrder), especially when using multi-byte primitives.
In conjunction, when accessing bytes as bigger primitives as int's or word's of more than one byte, these are inefficient when doing byte manipulation oneself, and theoretically use of the different available viewport's should be helpful; however, these are not well optimized for Java version 1.8 and are quite slow as they end up doing the same byte manipulations one would do oneself - in Java version 9 and higher, they have been optimized immensely (especially when nativeOrder is used) so they run much faster than doing these manipulations oneself as they use the byte codes to access memory directly (even when it is within the wrapped byte[] arrays. So our speed problem will largely go away automatically when we upgrade to using later versions of Java AS LONG AS WE USE NATIVE ORDER AND VIEWPORTS.
Long term, we would benefit in memory allocation/deallocation time for smaller sizes from using direct memory access only when required and using non-direct byte[] backed memory the rest of the time, but for now we will have to use only direct memory allocation for Java 1.8.
We should change the read/write API's to use view ports, as the direct ByteBuffer access methods address by bytes and thus need to do address manipulations whereas the view port API's automatically account for that and can be more directly translated by byte codes into equivalent machine codes. This change makes the biggest difference of about a 30% saving when either current direct memory allocation is used or for the higher Java versions where effective direct memory access is used as this improvement is barely noticeable with the current bulk loss of speed when using byte[] backed ByteBuffer's in Java 1.8.

In short, the biggest improvement we can make right now with the least changes for use of Java 1.8 is to always force the use of direct memory accesses (rather than the other way around as you previously suggested). This should make a difference of about a four times speedup in the use of Unboxed arrays, with other suggested changes refining that to improve a little more.

I think that by implementing these changes, which should only affect a small part of "MemoryManager.java" and "ByteArray.java", we can get an amazing speedup in use of Unboxed arrays, especially when we can support Java 9 and higher!

I could likely handle doing this as a PR if you would like, as it (shouldn't) involve changing Haskell/Eta code at all.

I'm still tuning the Eta benchmarks to show that these discovered about the Java back end will improve the Eta use of them as expected; I can post this series of Eta/Java benchmarks that clearly show these differences if you're interested.

rahulmutt commented 5 years ago

That Functor instance is not compiling because you're trying to define a Functor only for certain values of a and b (specifically those that have membership in the JObjectArray typeclass) which is counter to the expectation of Functor - that the transformation apply for all values of a and b. Is there any particular reason you're trying to define a Functor instance?

Yes, the Strict extension doesn't change the behavior of already compiled modules that didn't use the Strict extension.

Thank you for your investigation. Feel free to submit a PR that does the opposite of what I suggested earlier.

And yes, please submit the benchmarks if you can, say in a github repo.

I will see if I can fix the support for Java 9+.

GordonBGood commented 5 years ago

@rahulmutt, As to that Functor instance, I'm only having to deal with it because it is in the GHC.Arr code that I am having to bypass in order to support my (hopefully temporty) emulated primitives, and I understand what it does in making array transformations possible as a Functor. My question is related to is there a way that I can define the JObjectArray type class so it is seen as fulling the requirements of "for all a and b"? But I'm thinking that this temporary patching isn't the long term solution anyway as the patches are so "ugly" as I discuss below.

Thanks for the further explanation on Strict. It seems I've done what needed to be done there.

I have cloned your relevant repos and plan to PR from them, so haven't cluttered up my repo space with other minor experiments. Do you find it inconvenient to deal with projects that I just upload to this issue, or would you rather have link to something online that you can view? Let me know, and I'll start posting relevant work whichever way you prefer?

I'm getting ready to do my first compile of Eta to see the difference of forcing direct memory allocation all the time, but I already see by examining the code that it isn't going to entirely fix the problem of array access being slow: even if array access is almost instantaneous (relatively), this code will still be slow as compared to direct Java arrays due to so much nested method call overhead. As well as the byte address to primitive byte length calculation, there are multiple nested calls to class methods in handling the emulated memory management that is going to add up to about a hundred machine cycles, and that is where much of the time is going. This is because in your emulation, every array memory access is done very "C-like" in using the emulated Addr# pointer (instead of just by simple index operations), and emulating this is what takes all the time. Very clever, but costly in terms of execution.

As I speculated before, it looks like the cleanest way to fix this is to separate primitives into those that deal with Addr#'s and pinned memory and those that don't and design a more direct API NOT THROUGH MEMORY MANAGER for those primitives that don't need them.

That is a bigger project than I guessed in my last post, and I'm not sure I have enough knowledge (yet, but maybe getting there) to be able to handle it, as I'll have to have at least some ability to feed code to the CodeGen. It seems to me that we need to use the compiler's knowledge of whether the Addr# type is being used or not and output the code accordingly, either to the MemoryManager or to direct ByteBuffer methods.

OTOH, I am kind of loathe to tackle the Unboxed Array project by bypassing the current primitives (as I've attempted so far) because it makes the resulting patches "ugly/messy" with various limitations as to how it appears to users and with various type signature difficulties that I have not yet overcome (such as the main one I mentioned earlier about adding two contexts to the Functor instance signature), and would much rather change the source of the slow problem, which is partly/mostly the implementation of the current primitives.

Do you have any insight to offer on this?

rahulmutt commented 5 years ago

I completely agree that we need to design primitives to avoid the MemoryManager. I too was never satisfied with the constant need for address lookups - the MemoryManager does cache lookups to cater to the common case of sweeping though a contiguous memory region but perhaps we can do better.

The problem here is not the primitives, but the fact that Haskell libraries use pointer operations and the . We essentially will need to rewrite some of the core libraries like array, vector, bytestring, text, and so on to avoid the use of Ptr's at all. When doing patches, we've already substituted uses of Ptr to use FFI to a Java object instead when it makes sense, especially in the crypto libraries, but doing it for these other libraries will take time. It's worth it in the end though.

We have a some options: 1) See if there are some simple adjustments we can make to the MemoryManager to make the performance much more satisfactory. This is the least invasive change. 2) Create a set of new primitives involving all the different types of arrays and replace Ptr with these primitive types and operations in all the core libraries. This change requires a lot of work. 3) Maybe add an STG-to-STG pass that can convert from Addr# operations to the new array operations. We'll need a well-defined specification for how to go about this. This change is also not as invasive, but requires quite a lot of thought and I don't know whether we can do such a transformation in general.

Please do push your work to a github repository even if it's not in a compilable state - I periodically take a look and provide feedback that way and see past versions of your work as well. It's well suited for this since there are going to be several iterations - from PoC to production-ready.

GordonBGood commented 5 years ago

@rahulmutt, I hope to enjoy working together on solving this; I'm very much enjoying getting back into Haskell/Eta at a more intense level.

Yes, I see the CachedBlock class and its use, but that still leaves an extra method call or two to check if the block is already in the cache.

I'm now pretty familiar with the array package and I don't think it uses pointers at all, but haven't looked at the vector package or those others to do with strings and text. I'm sure that these last three do use pointers as the implementers will have come from a C mindset and that's how it's done there for speed. We especially can't ignore vector as it is the "go to" package for those wanting speed over even unboxed arrays, so when we are finished with array we need to be sure that what we do can also be applied there.

My first idea today was that we could do everything only modifying the MemoryManager as per your point 1) but now I think I see that while we will make some gains, they won't likely be enough due to the levels of nested method calls, and while the level of nesting might be reduced, I don't know that we can go far enough. I will try though.

Before today, I though I was working toward your point 2) but the trouble with new non-Haskell primitives is that their use will require all kinds of changes to the Haskell code base, taking the fork further from it and making future upgrades to support Haskell 8 and higher much more difficult. Although we might do this as a PoC, I don't know that it's a good long term solution. It's basically what I was working toward in this project.

I would likely never have come up with your point 3) about STG to STG transformations on my own, and therein I reveal my background: I;m a self-taught programmer who was educated in an age where the theories embodied in Haskell, or at least the practical application of such, didn't exist, and not being a Computer Science graduate (I am an Electronics Engineer by education and grew to be a Systems Engineer by occupation), I have never written even a minimal compiler - that was what stopped me from getting more involved with SPJ on Haskell, the issues which I reported and which he accepted would require changes to STG and beyond to code generation with which I have had no experience. Perhaps the experience I gain working with you can be applied to what I would really like to do back on Haskell. I really like this solution if it's possible, as it would convert Addr# operations to what suits them and non-Addr# operations to what is best for them, potentially while reducing nested calls for both. I believe that STG has the type information required in order to be able to do this. This is in contrast with Elm (another Haskell offshoot that compiles to JavaScript) for which it's author, Evan Czaplicki, had already thrown away the type information in the AST before code generation.

I'll post some of my current work to a repo within the next couple of days; I want it to reflect current thinking, so will hold off until I've done a couple of adjustments. I hope you don't mind the ongoing dialog, and if you don't, I'll keep you posted.

GordonBGood commented 5 years ago

@rahulmutt, Following is the test benchmark code that proves definitely that Eta Boxed Arrays have adequate speed and are comparable in speed as using FFI with Java Object[] arrays, but that Eta Unboxed Arrays are currently even slower than Boxed arrays, but that using direct memory allocated ByteBuffer backed arrays are almost as fast as using Java primitive arrays through FFI; all timings are for 100 loops:

{-# LANGUAGE MultiParamTypeClasses, ScopedTypeVariables, Strict,
             FlexibleContexts, MagicHash, UnboxedTuples, BangPatterns #-}

module Main where

import Java
import Data.Bits
import GHC.Int
import GHC.ST
import GHC.Base (Object(..), Int(..), Int64#(..))
import GHC.Exts (uncheckedIShiftRL#)
import GHC.Exts as OldPrims
import GHC.Arr as DArrs
import Data.Array.Base as UArrs
import Data.Array.ST (runSTUArray)

foreign import java unsafe "@static java.lang.System.currentTimeMillis"
  currentTimeMillis :: IO Int64

data JLongObjectArray = JLongObjectArray @java.lang.Long[]
  deriving Class

instance JArray JLong JLongObjectArray

primesbenchJObjArr :: Int -> IO Int
primesbenchJObjArr lps = java $ do
  !cmpsts <- arrayFromList $
                (map toJava $ take 2048 $ repeat (0 :: Int64) :: [JLong])
  let
    loop x =
      if x <= 0 then
        let
          cntem i !c =
            if i >= 131072 then return c else do
              v <- withObject cmpsts $ aget (i `shiftR` 6)
              if (fromJava v :: Int64) .&. (1 `shiftL` (i .&. 63)) /= 0
                then cntem (i + 1) c
              else cntem (i + 1) (c + 1)
        in cntem 0 1
      else
        let
          loopi i =
            if i > 254 then loop (x - 1) else do
              v <- withObject cmpsts $ aget (i `shiftR` 6)
              if ((fromJava v) :: Int64) .&. (1 `shiftL` (i .&. 63)) /= 0 then loopi (i + 1) else
                let
                  p = i + i + 3
                  s0 = 2 * i * (i + 3) + 3
                  loopc c =
                    if c >= 131072 then loopi (i + 1) else do
                      let w = c `shiftR` 6
                      v <- withObject cmpsts $ aget w
                      withObject cmpsts $ aset w $ toJava ((fromJava v) .|. ((1 :: Int64) `shiftL` (c .&. 63)))
                      loopc (c + p)
                in loopc s0
        in loopi 0
  loop lps

primesbenchPrims :: Int -> Int64
primesbenchPrims lps = runST$ ST $ \s0# ->
  let
    !(# !s1#, !cmpsts# #) = OldPrims.newArray# 2048# (0 :: Int64) s0#
    loop x sl# =
      if x <= 0 then
        let
          !(# !sl'#, !arr# #) = OldPrims.unsafeFreezeArray# cmpsts# sl#
          go 131072 !c = c
          go i@(I# i#) !c =
            let (# v #) = OldPrims.indexArray# arr# (i# `uncheckedIShiftRL#` 6#) in
            if v .&. (1 `shiftL` (i .&. 63)) /= 0 then go (i + 1) c
            else go (i + 1) (c + 1)
        in (# sl'#, go 0 1 #)
      else
        let
          loopi i@(I# i#) si# =
            if i > 254 then loop (x - 1) si# else
            case OldPrims.readArray# cmpsts# (i# `uncheckedIShiftRL#` 6#) si# of
              !(# !si'#, v #) ->
                if v .&. (1 `shiftL` (i .&. 63)) /= 0 then loopi (i + 1) si# else
                let
                  p = i + i + 3
                  s0 = (i `shiftL` 1) * (i + 3) + 3
                  loopc c@(I# c#) !sc# =
                    if c >= 131072 then loopi (i + 1) sc# else
                    let w# = c# `uncheckedIShiftRL#` 6# in
                    case OldPrims.readArray# cmpsts# w# sc# of
                      !(# !sc'#, !v #) ->
                        let msk = 1 `shiftL` (c .&. 63) in
                        case OldPrims.writeArray# cmpsts# w# (v .|. msk) sc'# of
                          !sc''# -> loopc (c + p) sc''#
                in loopc s0 si'#
            in loopi 0 sl#
  in loop lps s1#

primesbenchArray :: Int -> Int64
primesbenchArray lps = findcnt 0 1 where
  !composites = runST $ do
    !cmpsts <- unsafeThawSTArray $ DArrs.listArray (0,2047) $ repeat (0 :: Int64)
    let
      loop x =
        if x <= 0 then do
          !fcmpsts <- unsafeFreezeSTArray cmpsts
          return fcmpsts else
        let
          loopi i =
            if i > 254 then loop (x - 1) else do
              v <- DArrs.unsafeReadSTArray cmpsts (i `shiftR` 6)
              if v .&. (1 `shiftL` (i .&. 63)) /= 0 then loopi (i + 1) else
                let
                  p = i + i + 3
                  s0 = (i `shiftL` 1) * (i + 3) + 3
                  loopc c =
                    if c >= 131072 then loopi (i + 1) else
                    let w = c `shiftR` 6 in do
                    let msk = 1 `shiftL` (c .&. 63)
                    !v <- DArrs.unsafeReadSTArray cmpsts w -- strictness to the max!!!
                    !dmy <- DArrs.unsafeWriteSTArray cmpsts w (v .|. msk)
                    dmy `seq` loopc (c + p)
                in loopc s0
        in loopi 0
    loop lps
  findcnt i !c =
    if i >= 131072 then c else
    let !v = DArrs.unsafeAt composites (i `shiftR` 6) in
    if v .&. (1 `shiftL` (i .&. 63)) /= 0 then findcnt (i + 1) c
    else findcnt (i + 1) (c + 1)

primesbenchJInt64Arr :: Int -> IO Int
primesbenchJInt64Arr lps = java $ do
  !cmpsts <- arrayFromList $ take 2048 $ repeat (0 :: Int64)
  let
    loop x =
      if x <= 0 then
        let
          cntem i !c =
            if i >= 131072 then return c else do
              v <- withObject cmpsts $ aget (i `shiftR` 6)
              if v .&. (1 `shiftL` (i .&. 63)) /= 0
                then cntem (i + 1) c
              else cntem (i + 1) (c + 1)
        in cntem 0 1
      else
        let
          loopi i =
            if i > 254 then loop (x - 1) else do
              v <- withObject cmpsts $ aget (i `shiftR` 6)
              if v .&. (1 `shiftL` (i .&. 63)) /= 0 then loopi (i + 1) else
                let
                  p = i + i + 3
                  s0 = 2 * i * (i + 3) + 3
                  loopc c =
                    if c >= 131072 then loopi (i + 1) else do
                      let w = c `shiftR` 6
                      v <- withObject cmpsts $ aget w
                      withObject cmpsts $ aset w $ v .|. (1 `shiftL` (c .&. 63))
                      loopc (c + p)
                in loopc s0
        in loopi 0
  loop lps

primesbenchByteArrPrims :: Int -> Int64
primesbenchByteArrPrims lps = runST$ ST $ \s0# ->
  let
    !(# !s1#, !cmpsts# #) = OldPrims.newByteArray# 16384# s0#
    loop x sl# =
      if x <= 0 then
        let
          !(# !sl'#, !arr# #) = OldPrims.unsafeFreezeByteArray# cmpsts# sl#
          go 131072 !c = c
          go i@(I# i#) !c =
            let !v# =
                  OldPrims.indexInt8Array# arr# (i# `uncheckedIShiftRL#` 3#) in
            case v# `andI#` (1# `uncheckedIShiftL#` (i# `andI#` 7#)) of
              0# -> go (i + 1) (c + 1)
              _  -> go (i + 1) c
        in (# sl'#, go 0 1 #)
      else
        let
          loopi i@(I# i#) si# =
            if i > 254 then loop (x - 1) si# else
            case OldPrims.readInt8Array# cmpsts# (i# `uncheckedIShiftRL#` 3#) si# of
              !(# !si'#, !v# #) ->
                case v# `andI#` (1# `uncheckedIShiftL#` (i# `andI#` 7#)) of
                  0# ->
                    let
                      p = i + i + 3
                      s0 = (i `shiftL` 1) * (i + 3) + 3
                      loopc c@(I# c#) !sc# =
                        if c >= 131072 then loopi (i + 1) sc# else
                        let w# = c# `uncheckedIShiftRL#` 3# in
                        case OldPrims.readInt8Array# cmpsts# w# sc# of
                          !(# !sc'#, !v# #) ->
                            let msk# = 1# `uncheckedIShiftL#` (c# `andI#` 7#) in
                            let nv# = v# `orI#` msk# in
                            case OldPrims.writeInt8Array# cmpsts# w# nv# sc'# of
                              !sc''# -> loopc (c + p) sc''#
                    in loopc s0 si'#
                  _  -> loopi (i + 1) si#
        in loopi 0 sl#
    init i@(I# i#) !sn# =
      if i >= 16384 then loop lps sn# else
      case OldPrims.writeInt8Array# cmpsts# i# 0# sn# of
        !sn'# -> init (i + 1) sn'#
  in init 0 s1#

primesbenchUArray :: Int -> Int64
primesbenchUArray lps = findcnt 0 1 where
  composites = runSTUArray $ do
    cmpsts <- newArray (0,16384) (0 :: Int8) :: ST s (STUArray s Int Int8)
    let
      loop x =
        if x <= 0 then return cmpsts else
        let
          loopi i =
            if i > 254 then loop (x - 1) else do
              v <- unsafeRead cmpsts (i `shiftR` 3)
              if v .&. (1 `shiftL` (i .&. 7)) /= 0 then loopi (i + 1) else
                let
                  p = i + i + 3
                  s0 = (i `shiftL` 1) * (i + 3) + 3
                  loopc c =
                    if c >= 131072 then loopi (i + 1) else
                    let w = c `shiftR` 3 in do
                    let msk = 1 `shiftL` (c .&. 7)
                    !v <- UArrs.unsafeRead cmpsts w -- strictness to the max!!!
                    !dmy <- UArrs.unsafeWrite cmpsts w (v .|. msk)
                    dmy `seq` loopc (c + p)
                in loopc s0
        in loopi 0
    loop lps
  findcnt i !c =
    if i >= 131072 then c else
    let !v = UArrs.unsafeAt composites (i `shiftR` 3) in
    if v .&. (1 `shiftL` (i .&. 7)) /= 0 then findcnt (i + 1) c
    else findcnt (i + 1) (c + 1)

data ByteBuffer = ByteBuffer @java.nio.ByteBuffer
  deriving Class

data ByteOrder = ByteOrder @java.nio.ByteOrder
  deriving Class

data Int64Buffer = Int64Buffer @java.nio.LongBuffer
  deriving Class

{-# INLINE newBB #-}
foreign import java unsafe "@static java.nio.ByteBuffer.allocateDirect"
  newBB :: Int -> ByteBuffer

{-# INLINE nativeOrder #-}
foreign import java unsafe "@static java.nio.ByteOrder.nativeOrder"
  nativeOrder :: ByteOrder

foreign import java unsafe "@static @field java.nio.ByteOrder.LITTLE_ENDIAN"
  littleEndian :: ByteOrder

{-# INLINE order #-}
foreign import java unsafe order :: ByteBuffer -> ByteOrder -> ByteBuffer

{-# INLINE makeInt64BB #-}
foreign import java unsafe "asLongBuffer"
  makeInt64BB :: ByteBuffer -> Int64Buffer

{-# INLINE writeInt64BB #-}
foreign import java unsafe "putLong"
  writeInt64BB :: ByteBuffer -> Int -> Int64 -> ByteBuffer

{-# INLINE writeInt64Buffer #-}
{-
writeInt64Buffer :: ByteBuffer -> Int -> Int64 -> ByteBuffer
writeInt64Buffer bb i val = writeInt64BB bb (i `shiftL` 3) val
-}
foreign import java unsafe "put"
  writeInt64Buffer :: Int64Buffer -> Int -> Int64 -> Int64Buffer

{-# INLINE newInt64Buffer #-}
newInt64Buffer :: Int -> Int64 -> Int64Buffer
newInt64Buffer size initval =
  let
    i64bb = makeInt64BB $ order (newBB $ size `shiftL` 3) littleEndian
    loop i =
      if i >= size then i64bb else
      case writeInt64Buffer i64bb i initval of
        !_ -> loop (i + 1)
  in loop 0

{-# INLINE readInt64BB #-}
foreign import java unsafe "getLong"
  readInt64BB :: ByteBuffer -> Int -> Int64

{-# INLINE readInt64Buffer #-}
{-
readInt64Buffer :: ByteBuffer -> Int -> Int64
readInt64Buffer bb i = readInt64BB bb (i `shiftL` 3)
-}
foreign import java unsafe "get"
  readInt64Buffer :: Int64Buffer -> Int -> Int64

primesbenchByteBuffer :: Int -> Int64
primesbenchByteBuffer lps =
  let
    !cmpsts = newInt64Buffer 2048 0
    loop x =
      if x <= 0 then
        let
          go 131072 !c = c
          go i !c =
            let !v = readInt64Buffer cmpsts (i `shiftR` 6) in
            case v .&. (1 `shiftL` (i .&. 63)) of
              0 -> go (i + 1) (c + 1)
              _ -> go (i + 1) c
        in go 0 1
      else
        let
          loopi i =
            if i > 254 then loop (x - 1) else
            case readInt64Buffer cmpsts (i `shiftR` 6) of
              !v ->
                case v .&. (1 `shiftL` (i .&. 63)) of
                  0 ->
                    let
                      p = i + i + 3
                      s0 = (i `shiftL` 1) * (i + 3) + 3
                      loopc c =
                        if c >= 131072 then loopi (i + 1) else
                        let w = c `shiftR` 6 in
                        case readInt64Buffer cmpsts w of
                          !v ->
                            let msk = 1 `shiftL` (c .&. 63) in
                            case writeInt64Buffer cmpsts w (v .|. msk) of
                              !_ -> loopc (c + p)
                    in loopc s0
                  _ -> loopi (i + 1)
        in loopi 0
  in loop lps

main :: IO ()
main = do
  cnt <- primesbenchJObjArr 100 -- warm up to trigger JIT
  print cnt
  strt <- currentTimeMillis
  fcnt <- primesbenchJObjArr 100
  stop <- currentTimeMillis
  print fcnt
  putStrLn $ "Java Object Array:  " ++ show (stop - strt) ++ " milliseconds."
  print $ primesbenchPrims 100 -- warm up to trigger JIT
  strtop <- currentTimeMillis
  let cntop = primesbenchPrims 100
  stopop <- currentTimeMillis
  print cntop
  putStrLn $ "Current primitives:  " ++ show (stopop - strtop) ++ " milliseconds."
  print $ primesbenchArray 100 -- warm up to trigger JIT
  strtarr <- currentTimeMillis
  let cntarr = primesbenchArray 100
  stoparr <- currentTimeMillis
  print cntarr
  putStrLn $ "Current through Array:  " ++ show (stoparr - strtarr) ++ " milliseconds."
  putStrLn "The above is as fast using current Eta Boxed arrays as through FFI"
  jcnt <- primesbenchJInt64Arr 100 -- warm up to trigger JIT
  print jcnt
  strtjarr <- currentTimeMillis
  cntjarr <- primesbenchJInt64Arr 100
  stopjarr <- currentTimeMillis
  print cntjarr
  putStrLn $ "Current through Java Array:  " ++ show (stopjarr - strtjarr) ++ " milliseconds."
  print $ primesbenchByteArrPrims 100 -- warm up to trigger JIT
  strtbap <- currentTimeMillis
  let cntbap = primesbenchByteArrPrims 100
  stopbap <- currentTimeMillis
  print cntbap
  putStrLn $ "Current ByteArray primitives:  " ++ show (stopbap - strtbap) ++ " milliseconds."
  print $ primesbenchUArray 100 -- warm up to trigger JIT
  strtuarr <- currentTimeMillis
  let cntuarr = primesbenchUArray 100
  stopuarr <- currentTimeMillis
  print cntuarr
  putStrLn $ "Current through Unboxed Array:  " ++ show (stopuarr - strtuarr) ++ " milliseconds."
  putStrLn "The above using Eta Unboxxed Array is slower than using Boxed Array and"
  putStrLn "much slower than using FFI - therefore unacceptible!!!"
  print $ primesbenchByteBuffer 100 -- warm up to trigger JIT
  strtbb <- currentTimeMillis
  let cntbb = primesbenchByteBuffer 100
  stopbb <- currentTimeMillis
  print cntbb
  putStrLn $ "Simple proposal using Int64 Byte Buffer:  " ++ show (stopbb - strtbb) ++ " milliseconds."
  putStrLn "The above shows that using ByteBuffer can be almost as fast as directly using Java Arrays!"
  putStrLn "This is the speed we are looking for..."

when run with "etlas run -O2" this produces:

Successfully built 1 module. Great Hustle!
23000
23000
Java Object Array:  877 milliseconds.
23000
23000
Current primitives:  758 milliseconds.
23000
23000
Current through Array:  893 milliseconds.
The above is as fast using current Eta Boxed arrays as through FFI
23000
23000
Current through Java Array:  124 milliseconds.
23000
23000
Current ByteArray primitives:  1249 milliseconds.
23000
23000
Current through Unboxed Array:  1251 milliseconds.
The above using Eta Unboxxed Array is slower than using Boxed Array and
much slower than using FFI - therefore unacceptible!!!
23000
23000
Simple proposal using Int64 Byte Buffer:  141 milliseconds.
The above shows that using ByteBuffer can be almost as fast as directly using Java Arrays!
This is the speed we are looking for...

GordonBGood commented 5 years ago

@rahulmutt, I'm in process of building Eta from source using stack as per the included scripts.

Can one build it from source not using stack if one already has a working GHC on their machine? The answer is "likely" but why, since as well as using the specific GHC version, one would have to arrange to download and compile all dependencies as well, as Stack uses a specific version of GHC and its dependencies from a "sandbox"? If so, does GHC have to be a specific version? The answer seems to be yes: currently GHC version 7.10.3

Are the standard Etlas and Eta compiling using the standard GHC native codegen back end? If so, have you ever tried specifying LLVM as the back end, as in my experience that can make code that has tight loops up to about two times faster? It could make the Etlas/Eta compile process faster, even though that isn't what we are working on here.

GordonBGood commented 5 years ago

@rahulmutt, I finished compiling Eta from source, then modified it to force using direct compared to forced not using direct with comparisons made using the benchmarks package from before.

The result is there isn't much difference if any when using byte sized elements (Int8/Word8) and some improvement when using multi-byte sized elements (such as Int64/Word64); this isn't the entire answer we are looking for, so we need to go further.

I take it that changing primitives (approach 2) means rebuilding the compiler. Does that mean continually rebuilding the eta executable? What are the command line commands to do that? I suppose that writing a STG to STG transformer (approach 3) would require adding an extra section to the compiler and using the same rebuild process?

I'm thinking about how one could short circuit the nested method calls for the ByteArray# and MutableByteArray# primitives (approach 1), and think we may be able to "short-circuit" a few nested method calls just by modifying the ByteArray class. To do this we require doing the following things:

We need to add a new private field that is either null when the ByteArray is pinned, indicating that the current scheme will be used, or a directly used Java ByteBuffer when it is not pinned for direct access.
We would change the ByteArray class methods to behave differently depending on whether this new field exists or not, such as for getBuffer methods we would just return the contained Java ByteBuffer if it exists (or more ideally the appropriate Java ByteBuffer viewport for a bit better performance) or do as currently if it doesn't exist, and similarly modified methods for all the copyXXX variations.
We would prefer to NOT convert normal byte/multi-byte array indexes to unpinned arrays to byte array byte pointers,, but just preserve them intact, as we already know what type of data we'll be using when they are used and will call the appropriate get/put methods, but this may not be possible without modified primitives as part of the conversion is built into them (the multiplication by the byte size of the primitive type) and the idea that every ByteArray is addressed by it's byte address pointer. If this can't be done (relatively) easily, then we could try back calculating the actual index from the "pointer", but that would lose some of the time gains even if it's possible, which starts to make this approach look like of little gain.
Given the above, the rest is relatively easy; we could just (optionally) change the methods so that they get/put from viewports rather than directly from the Java ByteBuffer to avoid having to manually do this ourselves, and would automatically choose which of these overloaded methods to use depending on the expected return type, but if it is difficult to do this this is an extra tuning measure...
However, this is hardly worth doing, as we have only eliminated one or at most two nested function calls and have actually added some steps to back calculate the true index from the byte address pointer, so won't get even a factor of two better performance.

It looks like we have to move on the approach two (my initial approach), which can use the PoC (even though the resulting code is somewhat "ugly") I was developing before to do it without immediately changing primitive ops, although that would be the long term production solution.

I'm still thinking about the STG to STG transformation (approach 3) with which it would seem one could achieve the steps of approach 1) but still eliminating the overheads, but if this approach 2 works it may not be necessary. It would have the disadvantage that an extra compiler pass would be necessary, but that may not cost much in compile time.

I'm thinking that my PoC for approach 2) will show that the primitives are actually doing as described for the rejected approach 1) except that the get/put primitives are going to have to check some sort of pinned/unpinned state and do one of two things accordingly; although potentially much more direct than approach 1), there will still be a branch condition to process.

What approach 3) gives us is the avoidance of the branch at runtime in that the transformer will have already routed code generation to one of two (or multi) possible equivalent "under the covers" primitives based on such things as pinned status and type of contained data (the size of the primitive, etc.). I'm starting to think that it wouldn't be that hard nor have much risk as it would be limited to only use with the unboxed primitives. In other words, I don't propose doing this in general (as everything other than ByteArray access is of adequate performance), but only using it for this purpose.

The great thing about it is that the PoC for approach 2) will become the general plan for this approach 3) and the definition of the programmer-visible primitives won't change at all but will be Haskell/Eta standard signatures, just the invisible actual code generation primitives will change (and be added).

Also, since the array package doesn't use Addr# at all other than as the contents of arrays, there is no need for approach 3) for this package nor can I find any uses of Addr# in the vector package; but there may be in the other packages you mentioned. Generally, Addr# will be only used for FFI but may occasionally be used for maximum speed in calling "C-like" operations; I've used it to effect in writing a major loop unrolling optimization, but even that was only necessary because of the code generation not doing a usual C-type optimization - that was part of the subject of the issue discussed with SPJ. If I ran that code on Eta, the pointer operations would actually run slower than what we are working toward by the ratio of what we are trying to achieve here.

However, if we don't do approach 3), the culmination of approach 2) requires primitive op changes.

So back to my PoC.

GordonBGood commented 5 years ago

@rahulmutt, I am going to have to have to do major patches to Array.Base in the array package to make the PoC work, as in there will still be a type class used, but fortunately this file doesn't contain any new instances for Functor so I shouldn't run into the "show-stopper" problem as before.

Since my benchmarks show that Boxed Arrays have adequate performance nearly equal to Java Object[] arrays, I am no longer going to patch GHC.Arr and that problem goes away.

rahulmutt commented 5 years ago

You could try adding -fllvm to ghc-options in the eta.cabal file and see what happens. AFAIK the LLVM backend as of GHC 7.10.3 is OK most of the time and sometimes not. We'd have to do benchmark to see if it's actually better for most workloads.

I don't think you should be concerned about method calls that much - the JIT inlines them pretty well. What we should be concerned about is whether the JIT is actually inlining & compiling properly and checking the generated assembly to see if it's being compiled well. It's contradictory but sometimes you need to create more sub-methods in order for the JIT to do it's work better because if the method size crosses a certain large threshold, the JIT compiler won't even bother to compile it and that method will stay in interpreted mode! We need to ensure that all the method calls inside of MemoryManager are properly compiled and inlined.

If you just want to check for inlining and compiling status of methods here you go:

export ETA_JAVA_ARGS="-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+PrintInlining"
etlas run

Make sure you pipe the output to a file or something because the output will be huge.

For checking out the assembly you do need to get hsdis setup properly. I did it a while ago and it wasn't too much trouble.

You don't have to worry about adding new primitives making it harder to backport GHC 8 because it doesn't. It's orthogonal. We already have several primitives that GHC doesn't.

Thanks for the updates.

GordonBGood commented 5 years ago

@rahulmutt, the LLVM question was an aside from this current issue, of course, as it would possibly improve compile times rather than working on better run time performance. I'll try the experiment eventually. I think that by GHC version 7.10.3 LLVM integration was fairly good; the GHC native code generator is actually pretty primitive in its register allocations (and SPJ agrees) and I only usually use it for quick compiles while developing Haskell code. Like I say, I will try the options you recommend - eventually.

I think I will continue the PoC of emulating the current Haskellish primitives to show what I think we can do that way before looking into why the current implementation is so slow; however, I can't see that this benchmark speed of about 120 CPU clock cycles per composite number culling loop could get this bad without a lot of not inlined method calls; usually this loop runs in Java in the 10 to 15 CPU clock cycles range. Even if we absolutely know lack of method inlining is the problem, there isn't a lot we could do about it as I discussed in analyzing approach 1) as the methods are already quite short; it could be that eliminating the index math in going to and fro pointers to indexes could make them shorter, but we will be doing that anyway in approach 2).

I may follow your suggestion of viewing the JIT inlining and compile status as you suggest as well as compiling hsdis to look at the generated assembly code, but that would be more just because I like investigating low level things and not so much because I think I will find anything that will help this project.

If approach 2) works out as I hope, we won't need any new primitives, just changes to the implementation of the current ones. Does changing current primitives that are only used by the array package this project is improving and not within Eta base libraries require a cleaninstall? I suppose that it would so that recompiling the patched array package would see the changes?

Approach 3) may need new "under the covers" primitives for a few cases where the same Addr# primitive may get used for either pinned or unpinned ByteArray's (although it is recommended that they never be used for unpinned arrays, it isn't checked).

Hopefully I can have a PoC of approach 2) up on a repo within the next 24 hours for your consideration and further discussion.

Once this is done, I may dust off my STG skills and see what we could do with adding approach 3) before wasting to much time finalizing the primitive ops changes required for approach 2). Given that it is fairly simple and unlikely to introduce problems, do you think that adding a compiler pass would be easier or harder than changing primitives? Which do think it would be harder for me to learn to do?