About the new memory-mapped-file benchmark

Apollo3zehn commented 1 year ago

What is the suggestion?

Hi, I came across your library since I am also developing a non-native HDF5 reading library (for .NET) and noticed your new memory-mapped file benchmark. I have some comments on it:

Memory-mapped files work best for random file access, because of reduced number of system calls. However, your benchmark is just copying a large sequential block of data where I guess MMF will not shine.
Both benchmarked methods allocate an array per iteration, so GC overhead will become a signifcant part of the benchmark result.
File system cache will hold the whole file in memory right after is has been written to disk. I.e. your benchmarks are not measuring disk access performance but memory access performance. Maybe this is intended, but I prefer to have it mentioned at least.

Hopefully you find some of the points above useful :-) If not, please ignore this issue, I just wanted to write down my thoughts on this benchmark.

jamesmudd commented 1 year ago

Thanks for looking at jHDF and for looking at the benchmark. PureHDF looks really good, nice to see another project with similar goals.

Agree memory mapping is best suited to random access. However the way jHDF works at the moment it read sections of the file as needed. I started considering this benchmark because of http://pkirkham.github.io/pyrus/parallel-io-improvements/ it shows jHDf is slower at smaller chunk sizes. I suspected this is due to the overhead of setting up memory mapping, as suggested by https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#map-java.nio.channels.FileChannel.MapMode-long-long- . The results of this benchmark seem to confirm this. I agree this isn't a great way to use memory mapping but to change this would need some rewriting of jHDF to hold a larger segment of the file already mapped.
Agree this could be improved however its the same for both memory-mapped and conventional reads and with JMH handling the repeats the GC should be averaged out, I don't think this would affect the overall results.
Agree again with this, but as the main goal for me was assessing the overhead of setting up the mapping compared to the conventional read I think this is ok. Both the memory mapped and conventional reads would benefit from OS caching.

Thanks again for the comment, and some nice ideas for improvements. If you perform any benchmarks including jHDF I would be very interested to see the results.

Apollo3zehn commented 1 year ago

Thank you for the link, very interesting. Unfortunately I cannot find any graph showing that jHDF is smaller at slower at smaller chunks sizes, I see it only written in the text. Did I oversee something?

Why don't you just map the whole file into memory and use it for all I/O, not only to read the actual data but also for the rest of the file contents? Then you would set it up only once and use it everywhere.

Regarding your second point: Yes it is the same for both benchmarks and the time different between both will not be affected largely but if you calculate the ratios between the benchmark results, they will change. What I mean is this:

Current setup (example numbers)

Benchmark 1: 10 ms/op
Benchmark 2:  9 ms/op

Difference: 1 ms/op, ratio = 0.90

Improved setup (example numbers)

Benchmark 1: 3 ms/op
Benchmark 2: 2 ms/op

Difference: 1 ms/op, ratio = 0.67

I would just create a single buffer and re-use it everywhere since thread-safety is no issue here :-)

Apollo3zehn commented 1 year ago

Ah I forgot to note that I plan to create benchmarks to compare all reimplemented HDF5 libraries (Java, Julia, C#, Python, (JS?)) but that has no high priority right now. I am working hard on preparing my project for a first release and I am happy when that is done as HDF5 is a tough nut :D

jamesmudd commented 1 year ago

Thank you for the link, very interesting. Unfortunately I cannot find any graph showing that jHDF is smaller at slower at smaller chunks sizes, I see it only written in the text. Did I oversee something?

In figure 4 you see the parallel traces (jHDF) have slower speed reads at smaller chunk sizes (where the mapping overhead starts to dominate)

Why don't you just map the whole file into memory and use it for all I/O, not only to read the actual data but also for the rest of the file contents? Then you would set it up only once and use it everywhere.

The main reason is a Java limitation that the maximum map size is ~2GB https://github.com/openjdk/jdk/blob/jdk-11%2B28/src/java.base/share/classes/sun/nio/ch/FileChannelImpl.java#L946 so I would need code to handle mapping multiple 2GB chunks. This is possible but not been a high priority.

Regarding your second point: Yes it is the same for both benchmarks and the time different between both will not be affected largely but if you calculate the ratios between the benchmark results, they will change. What I mean is this:

Current setup (example numbers)
Benchmark 1: 10 ms/op
Benchmark 2:  9 ms/op

Difference: 1 ms/op, ratio = 0.90
Improved setup (example numbers)
Benchmark 1: 3 ms/op
Benchmark 2: 2 ms/op

Difference: 1 ms/op, ratio = 0.67
I would just create a single buffer and re-use it everywhere since thread-safety is no issue here :-)

Agree with this and will look into this improvement to the benchmark but will actually look again at the mapping approach.

Good luck with making a release and hope seeing the jHDF code has been useful.

Apollo3zehn commented 4 months ago

Ah I forgot to note that I plan to create benchmarks to compare all reimplemented HDF5 libraries (Java, Julia, C#, Python, (JS?)) but that has no high priority right now. I am working hard on preparing my project for a first release and I am happy when that is done as HDF5 is a tough nut :D

I have been working on the before-mentioned benchmarking repository and it is now in a usable state. This repository contains benchmarks for jHDF, PureHDF and pyfive as well as the C-library (HDF5 1.10) using a very thin C# wrapper.

The idea is that every commit on that repository triggers a GitHub CI action which runs all benchmarks. Running benchmarks on a CI is not 100% reliable, but still quite good. The results are being commited to the gh-pages branch and the history of results is then rendered there via Chart.js: https://apollo3zehn.github.io/hdf5-benchmark/

I already found it quite useful to be able to track down the most important performance killers (using a performance profiler) and then see the real effects in the graphs.

I am no Java expert and for a benchmark to be meaningful it should be written correctly so they compare well to the benchmarks of the other libraries. So if you are interested, it would be great if could have a look to these benchmarks: https://github.com/Apollo3zehn/hdf5-benchmark/tree/master/java-jhdf/test/src/main/java/org/java_jhdf.

For example, the C-library and PureHDF both allow passing a preallocated array to the read method and thus in case of C#, the garbage collector has much less to do. I am not sure if the same is possible with jHDF. If yes, it should be added to the benchmark :-)

A general problem is caching. On the one hand, caching is an important performance feature but it can disturb the benchmark which reads the same dataset over and over again. So ideally the cache is active during the read operation and then reset for the next loop iteration. This heavily affects the performance of the C-library in a (false) positive way and I am currently trying to clarify this with the HDF group (https://forum.hdfgroup.org/t/reset-chunk-cache/12424).

Please tell me if you have other benchmark ideas :-) I tried to cover the most useful ones for now.

jamesmudd commented 4 months ago

Thanks so much for this work. Its really interesting. Makes me want to do some profiling and see if jHDF could be faster. The benchmarks looks reasonable to me, some of the other do the validation outside the timed section so that is slightly different, and the allocation of the arrays as you mention in included in the java there isn't any way to provide a pre allocated array at the moment but could be something I could look at, would want to make sure the API makes sense though.

Apollo3zehn commented 4 months ago

Thanks for your feedback! The validation is not outside yet because it contributes only a tiny amount of the total time needed for the tests which have been slowed down to be in the milliseconds range instead of microseconds. But I am still working on fine-tuning the benchmarks so I will put it on my todo list :-)

jamesmudd / jhdf

About the new memory-mapped-file benchmark #454