jamesmudd / jhdf

A pure Java HDF5 library
http://jhdf.io
MIT License
137 stars 37 forks source link

Add HDF5 writing capability ? #354

Closed zhuam closed 2 months ago

zhuam commented 2 years ago

rt

thanks

jamesmudd commented 2 years ago

Thanks for raising the issue. I would like to add writing support and its currently a work in progress.

I have implemented a few of the prerequisites e.g. the Jenkins hash and there is a branch https://github.com/jamesmudd/jhdf/tree/writing which has a test https://github.com/jamesmudd/jhdf/blob/writing/jhdf/src/test/java/io/jhdf/SimpleWritingTest.java that will write an empty file that can be opened, however there is still lots of work to do.

As this is a spare time project (and I don't have much spare time at the moment) I can't commit to a timescale for implementing writing. I will leave the issue open and point to it for people to react to as a way to gauge interest.

zhuam commented 2 years ago

Thank you @jamesmudd Can you add some basic writing functions, such as writing attribable or variable, so that we can also participate

jamesmudd commented 2 years ago

Unfortunately its not really easy to add "basic" writing functions, currently there is only support for writing a few very limited structures. jHDF will need to add writing to many more structures and then also add support for laying out the file on disk (currently this is hard coded in the branch). Design of the API also needs to be considered so it makes sense to use.

jonathanschilling commented 2 years ago

Hi, regarding the basic functionality, I would like to point to @uhoefel s and my work on the Nujan library (uhoefel/nujan), where we tried to slightly modernize it. If writing capabilities would be added to this project, we would be happy to retire our Nujan fork in favour of this library :-)

Reissner commented 2 years ago

I need also both reading and writing. My project is an integration of octave into java. Read/write variables is via save/load to file formats, i.e. streams. I took over this project from a friend and he used the textual format. This is slow and also introduces small failures due to decimal/binary conversion. Well, i could use matlab internal format, but this requires reengineering. RW access is very much appreciated.

vedina commented 2 years ago

Read/Write access much appreciated as well

moraru commented 2 years ago

+1

jbfaden commented 1 year ago

+1

jbfaden commented 1 year ago

I've been using a NetCDF reader to read some HDF5 files for years, and I would love to be able to use this library. However, I need to be able to write HDF5 as well.

marcelluethi commented 1 year ago

As many others, I was looking for a solution to write hdf5 files without adding native libraries to my project. After trying and failing with nujan (which writes the hdf5 files just fine, but for some reason the files cannot be read with jhdf) I found a workaround which solves my problem for the moment.

I write the files in the hdfjson and use the converters provided by the hdf-group to create the hdf5 files. It is reasonably straight-forward to write files in the hdf5-json format. I also published my code on github scalismo-hdf5-json in case somebody wants to go down the same route.

However, I would also love to see writing support in jhdf. And at that point I would also like to thank everybody involved in creating jhdf for the awesome work. The project is great and having a pure java library to read hdf5 is super helpful.

jamesmudd commented 1 year ago

As many others, I was looking for a solution to write hdf5 files without adding native libraries to my project. After trying and failing with nujan (which writes the hdf5 files just fine, but for some reason the files cannot be read with jhdf) I found a workaround which solves my problem for the moment.

Would you be able to open another issue with an example of a file jHDF can not open I might be able to fix that.

I write the files in the hdfjson and use the converters provided by the hdf-group to create the hdf5 files. It is reasonably straight-forward to write files in the hdf5-json format. I also published my code on github scalismo-hdf5-json in case somebody wants to go down the same route.

However, I would also love to see writing support in jhdf. And at that point I would also like to thank everybody involved in creating jhdf for the awesome work. The project is great and having a pure java library to read hdf5 is super helpful.

Thanks a lot for the comments. Know there is a lot of interest in writing support and I hope to get some time to work on it soon!

marcelluethi commented 1 year ago

Sorry for the delayed reply. You can find a simple example that showcases the problem in this gist.

It throws the following exception when reading the file: io.jhdf.exceptions.UnsupportedHdfException: Superblock extension is not supported

It is easily possible to change nujan such that it does set the superblock extension flag differently. However, after doing that, another error was thrown, which I had no idea how to solve. My knowledge of hdf5 is, unfortunately, extremely limited.

thadguidry commented 9 months ago

Sponsoring you now and also specifically to help work on this issue, sent $300. Go, Go Go @jamesmudd !

jamesmudd commented 9 months ago

@thadguidry Thanks very much for the sponsorship. I will prioritise working on this in my free time and hopefully give an update soon.

jamesmudd commented 8 months ago

Some good progress #530 implements lots of the required logic. Its a large PR so still needs quite a lot of tidy up, but it can write the structure of a file (i.e. only groups but in any nesting). So its a pretty big step IMO. Want to try and clean this up a bit and merge it then will look at writing datasets. Think I will aim for just int[] and double[] initially and then maybe consider a release. Would be happy to hear any feedback

jons-pf commented 8 months ago

Thanks a lot, @jamesmudd (and @thadguidry for the sponsorship)!

thadguidry commented 8 months ago

@jamesmudd Love how you worked on this, just reading your commits shows how you probably spent the first hour just thinking, researching, and writing down an outline of work to be done! You saw the big chunk of problems and broke them up into small bite sized pieces and broke up even those into very tiny pieces and then began to implement them first and write tests. You'd be a great mentor to others. Seriously.

Apollo3zehn commented 8 months ago

@jamesmudd it might be useful for the unit tests to use h5dump to dump files written by jhdf which helps to ensure compatibility with the HDF5 C-lib.

The following Github Actions example show how to quickly install h5dump:

- name: Download HDF5 installer
  if: steps.cache-primes.outputs.cache-hit != 'true'
  run: wget -q https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.14/hdf5-1.14.1/bin/unix/hdf5-1.14.1-2-Std-ubuntu2204_64.tar.gz

- name: Install
  run: |
    tar -xzf hdf5-1.14.1-2-Std-ubuntu2204_64.tar.gz
    hdf/HDF5-1.14.1-Linux.sh --prefix=hdf --skip-license
    sudo ln -s $(pwd)/hdf/HDF_Group/HDF5/1.14.1/bin/h5dump /usr/bin/h5dump
    h5dump --version

And then use it in the unit tests to compare it against the expected dump output (C# example):

var actual = DumpH5File(filePath);

var expected = File
    .ReadAllText($"DumpFiles/attribute_on_group.dump")
    .Replace("<file-path>", filePath);

Assert.Equal(expected, actual);

// ...

public static string? DumpH5File(string filePath)
{
    var dump = default(string);

    var h5dumpProcess = new Process 
    {
        StartInfo = new ProcessStartInfo
        {
            FileName = "h5dump",
            Arguments = filePath,
            UseShellExecute = false,
            RedirectStandardOutput = true,
            RedirectStandardError = true,
            CreateNoWindow = true
        }
    };

    h5dumpProcess.Start();

    while (!h5dumpProcess.StandardOutput.EndOfStream)
    {
        var line = h5dumpProcess.StandardOutput.ReadLine();

        if (dump is null)
            dump = line;

        else
            dump += Environment.NewLine + line;
    }

    while (!h5dumpProcess.StandardError.EndOfStream)
    {
        var line = h5dumpProcess.StandardError.ReadLine();

        if (dump is null)
            dump = line;

        else
            dump += Environment.NewLine + line;
    }

    return dump;
}

An example h5dump output for a file with one group and two attributes would look like this:

HDF5 "<file-path>" {
GROUP "/" {
   GROUP "group" {
      ATTRIBUTE "attribute 1" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SCALAR
         DATA {
         (0): 99.2
         }
      }
      ATTRIBUTE "attribute 2" {
         DATATYPE  H5T_IEEE_F64LE
         DATASPACE  SCALAR
         DATA {
         (0): 99.3
         }
      }
   }
}
}

I think this approach makes it much easier to validate h5 files compared to using the C-library directly (or using a wrapper), because that might become quite difficult for more complex features (e.g. compounds).

When the h5dump call succeeds and the output is as expected, the file is valid.

jamesmudd commented 8 months ago

Thanks @Apollo3zehn I really like this idea. Would be good to have tests confirming compatibility. Currently I have been doing this manually but think this approach could work well. Might be a little tricker for my CI as I'm currently building on all platforms but should be possible. I think h5dump supports JSON output so might look at parsing that back to do assertions.

jamesmudd commented 8 months ago

535 is the next PR. Still lots to do, but it successfully writes an int[] dataset which can be read back by jHDF and HDFView. So IMO another milestone. I'm also thinking most of unknowns have been tackled and now its a matter of building out support from this POC.

jamesmudd commented 8 months ago

I have now merged #535 which adds basic dataset writing support. Next plan is to make an alpha release so people can try it out. Then work on compatibility testing and cleaning the code up so hopefully others can help building out wider support. Also be interested in what writing support is most useful to prioritise.

jons-pf commented 8 months ago

Amazing, thanks a lot!

There are a bunch of test files in the mainline HDF5 repo: https://github.com/HDFGroup/hdf5/tree/develop/test/testfiles The ultimate goal for jhdf could be to be able to reproduce all of them?

On a more short term, I would suggest to target the following functionality:

@uhoefel Wonders still happen!

jamesmudd commented 8 months ago

Have just published v0.7.0-alpha which includes the initial writing support. It can write group structures, and n-dimensional int and double datasets. Should be on Maven Central shortly.

See WriteHdf5.java for example usage.

If anyone tries this out would be great to hear about the results.

cfoushee commented 7 months ago

I could use this for a project right now if you added the ability to write char[]. I'm assuming that would that simply be a byte[] in java. Also, I would need to be able to write at least 2 attributes types like long and string and associate that with a dataset.

I did try out the WriteHdf5.java and it works perfectly for me.

jamesmudd commented 7 months ago

Thanks for giving it a try and great to hear it worked well.

Do you actually want to write char[] or a String dataset?

I think attributes should be possible. I have got a bit side tracked working on interoperability tests. That's proving harder than I thought, so I think I should probably leave that for now. Work on adding some more support like attributes and make a first release with writing.

cfoushee commented 7 months ago

I need to write java byte[] which I assume map to HDF char[]

jamesmudd commented 7 months ago

I have just merged support for writing byte[]. Will probably break attributes out into another issue.

cfoushee commented 7 months ago

Successfully was able to write a dataset with bytes. Thank you!!

jln-ho commented 6 months ago

+1

jamesmudd commented 4 months ago

Have just released v0.7.0 which adds writing support. Thanks for all the interest in this and hope people give it a try. I am aware it's still limited and intend attributes and string datasets to be the next features to add.

thadguidry commented 4 months ago

Cool. One of the things I am planning on is eventually having OpenRefine 4.0 have a HDF5 exporter.

jamesmudd commented 2 months ago

With the v0.8.0 release im going to close this issue. Writing HDF5 files is now possible with jHDF, there are still things to add, but think these are better tracked as new smaller issues. If you want a writing feature not possible at the moment feel free to open another issue.

Special thanks to @thadguidry for the support.