steven-varga / h5cpp

C++17 templates between [stl::vector | armadillo | eigen3 | ublas | blitz++] and HDF5 datasets
http://h5cpp.org
Other
141 stars 32 forks source link

How to install the h5pp or the h5cpp-compiler on macOS? #78

Closed xdotli closed 2 years ago

xdotli commented 2 years ago

I'm sorry if it sounds a stupid question, but I'm very new to C++ development. Since I'm on the mac I couldn't follow the linux commands provided at the download page. I wonder how I could use this library (for example what files to copy and paste into the /usr/local/include directory) in an arbitrary project working with dlib.

steven-varga commented 2 years ago

hmm... I don't quite have a mac-OS to work with, we could have a chat about this, if any interest? -- it should not be a biggie as LLVM tool chain works on mac.

xdotli commented 2 years ago

@steven-varga Hi! And thank you for your reply.

I now could successfully compile the program by the following verbose command: g++ -std=c++17 -o test cca.cpp -I /usr/local/include/ -L /usr/local/lib/ -lhdf5 -lhdf5_hl

But as I run the executible it gives the error of "unable to open dataset":

HDF5-DIAG: Error detected in HDF5 (1.12.1) thread 0:
  #000: H5D.c line 285 in H5Dopen2(): unable to open dataset
    major: Dataset
    minor: Can't open object
  #001: H5VLcallback.c line 1910 in H5VL_dataset_open(): dataset open failed
    major: Virtual Object Layer
    minor: Can't open object
  #002: H5VLcallback.c line 1877 in H5VL__dataset_open(): dataset open failed
    major: Virtual Object Layer
    minor: Can't open object
  #003: H5VLnative_dataset.c line 123 in H5VL__native_dataset_open(): unable to open dataset
    major: Dataset
    minor: Can't open object
  #004: H5Dint.c line 1483 in H5D__open_name(): not found
    major: Dataset
    minor: Object not found
  #005: H5Gloc.c line 442 in H5G_loc_find(): can't find object
    major: Symbol table
    minor: Object not found
  #006: H5Gtraverse.c line 837 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #007: H5Gtraverse.c line 613 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #008: H5Gloc.c line 399 in H5G__loc_find_cb(): object 'create then write' doesn't exist
    major: Symbol table
    minor: Object not found
libc++abi: terminating with uncaught exception of type h5::error::io::dataset::open: /usr/local/include/h5cpp/H5Dopen.hpp line#  30 : opening dataset failed...
[1]    71327 abort      ./test

Do you happen to know any common causes to this error? I have been working too long only to try to read some hdf file. I have converted the original file to several JSON file to read into my program now. However, the JSON files appear to be much larger than the HDF files, I wonder if you think it will have performance issues?

steven-varga commented 2 years ago

Can you list the version of the file? Would it be possible to try it with libhdf5 v1.10.6? BTW: no need for the hdf5_hl. Can you share the file?

Not sure what you are trying to do; JSON is not for HPC, has different properties/use cases. In fact the acronym gives it away: JavaScript Object Notation whereas HDF5 is like ext4 filesystem with a convenient API, and most importantly MPI-IO capability.

xdotli commented 2 years ago

Sure. I'm using HDF 1.12.1. The code is below:

#include <iostream>
#include <dlib/matrix.h>
#include <dlib/statistics/cca.h>
#include <h5cpp/all>

using namespace std;
using namespace dlib;
template <class T>
using Matrix = dlib::matrix<T>;

int main()
{
  Matrix<short> M = h5::read<Matrix<short>>("1000hpa.h5", "create then write");
  return 0;
}

Regarding the choice of JSON. Well, I worked with JavaScript and Python the most, and I'm simply trying to read a 726*14729 matrix into my program so I though maybe dump the data in HDF into JSON and read the JSON into my program would be possible.

By the way I'm using the nlohmann/json where the library is in a single header file. Should I delete the JSON object once the data are stored in matrices?

xdotli commented 2 years ago

I think I'm not expressing my concerns clearly enough. The computation-heavy part of my program would be in the calculation of the matrices, so I wonder if I could assume performance of the I/O part before is not as relevant?

steven-varga commented 2 years ago

It depends on the size of the matrix and convenience. Some of us do prototyping on some statistical platform: Julia/Matlab/R and then save/export the file in HDF5. It is convenient to load it from C++ regardless of the size, then proceed to fast implementation using C++ with some linear algebra library. -- this is one use case of H5CPP.

Alternatively you have 10GB - and up datasets, and you need efficient scalable IO. In the case the IO performance could be important.

Overall you can think of this question as walking on a Pareto front of implementation | maintenance | IO | runtime cost, which can only be answered (by a constrained optimisation mathematical program) if you have values ready.

xdotli commented 2 years ago

Thank you so much for your answer. The matrix is 72614965, and my out put is supposed to be 6 726726 matrices.

If this prototyping to production workflow is so prevalent, then it's imperative for me to work out a way to make this library work on my computer. Because I'm a new big data research assistant at school, and while other team members do prototyping in R/Python, I have to deliver C++ code that implements their algorithm in parallelism.

Anyway, I used brew install hdf5 as my last attempt to make the h5cpp work for me, but the library still gave me the unable to open dataset error. Before I had errors like Undefined symbols for architecture x86_64: "***". But would you say I will have a better chance of making all of this work on a linux server? Because in that way I won't be trying to install all kinds of libraries everywhere. Thank you very much!

steven-varga commented 2 years ago

It works on POSIX with C++17, and as I mentioned before I am opened to a conference call (I don't have a mac). Avoid using OS package manager -- this is HPC, where SPACK is more likely being used. Instead install components from source. Here is a laundry list:

I am working on a reference platform: a rental cluster on AWS EC2 with the proper settings and convenient vscode front end; but will take a few more weeks to bring it up online.

Let me know about the call

xdotli commented 2 years ago

I actually noticed the issue #42 where you are testing h5cpp's compatibility with different compilers before submitting this issue. I'd love to go over the laundry list to install these components, but I have a deadline about 12 hours from now. I'll work on this in my next assignment and report it here.

Thanks again for your help! I will try setting up a server environment to do the job as well. I'm familiar with vim, so I think I'll test h5cpp library before making further setting.

xdotli commented 2 years ago

@steven-varga Sorry to bother you again! I tried installing HDF5 1.10.6, but in this case I don't know where folder to put it. Should I dump them in the /usr/local/include?

steven-varga commented 2 years ago

H5CPP doesn't care where you install HDF5. As for H5CPP headers: copy them to /usr/local/include/h5cpp then in your make files use gcc -I/usr/local/include It is customary to install local packages to user local: ./configure --prefix=/usr/local. Below are the default settings (after configure):

Features:
---------
                   Parallel HDF5: no
Parallel Filtered Dataset Writes: no
              Large Parallel I/O: no
              High-level library: yes
                    Threadsafety: no
             Default API mapping: v110
  With deprecated public symbols: yes
          I/O filters (external): deflate(zlib)
                             MPE: no
                      Direct VFD: no
                         dmalloc: no
  Packages w/ extra debug output: none
                     API tracing: no
            Using memory checker: no
 Memory allocation sanity checks: no
             Metadata trace file: no
          Function stack tracing: no
       Strict file format checks: no
    Optimization instrumentation: no

And no need to link against libhdf5_hl.so instead use the templated h5::append operator on h5::pt_t<T> for packet table; much faster.

xdotli commented 2 years ago

It turned out that the unable to open dataset error was due to a python program I forgot to shut down was reading it, too. Thus it somehow made the dataset unavailable. As soon as I quit the python script I can use h5cpp to open the file successfully.

Thank you @steven-varga for guiding me re-installing the libraries from the source and providing me the workflow around high-performance data I/O and computing. Your library is awesome!