q-posev commented 1 year ago

Hello Psi4! First of all, thank you for your hard work on improving the Psi. This is not a bug report but rather a possible enhancement: we have recently developed a wave function format called TREXIO with the focus on self-consistence and I/O performance. The format comes with a C library and an API to interact with the data: it has a text (ASCII-based) and HDF5 (binary) back ends for I/O. Bindings in Python, Fortran and OCaml are available. Sparse data like 2e integrals or CI coefficients are stored in a sparse data representation (similar to FCIDUMP) which significarntly improved the performance of the I/O (especially in the HDF5 case). The source code can be found here and a detailed description of the format and an API can be found here.

Would it be interesting to have TREXIO as an alternative I/O back end in Psi4? The library is packaged for both PyPI and conda (via conda-forge). I am one of the core developers and can help/contribute.

JonathonMisiewicz commented 1 year ago

Hi, I'm one of the core developers for Psi. One of my personal objectives for this year is to replace our current I/O system, which has not been significantly redesigned since its original creation in the '90s. While it fulfilled its original design goals of simplifying the API compared to its predecessor, its steep learning curve and difficult to understand error messages are obstacles to continued Psi development, given that our graduate student-led development inherently has high turnover.

I can tell you now that your project is of immediate interest. I'll look over the project in detail and have more detailed thoughts for you by this time next week.

q-posev commented 1 year ago

Glad to hear that we are right on time :-)

The error handling is something central for us too as we aimed to make trexio user-friendly. Btw, if you notice something needed for Psi4 which is currently missing in the trexio format - feel free to open an issue in the repo or ping me.

loriab commented 1 year ago

Aside from I/O backend, trexio could join fchk, fcidump, qcschema wfn, molden as an export format.

q-posev commented 1 year ago

This will be particularly interesting for the Quantum Monte-Carlo users of TREXIO since some of them use TREXIO files as an input (the format was initiated within the QMC community).

scemama commented 1 year ago

Hi! I am also a developer of TREXIO, and of the quantum package software. In quantum package, we have the ability to import/export data from TREXIO files, and we use it to export wave functions for QMC codes, one- and two-body RDMs to send them to the GammCor code to perform SAPT calculations, and we export integrals in TREXIO to perform FCIQMC calculations with NECI. It would be nice for us to also be able to exchange data with Psi4 (in the two directions). I am also willing to help if needed!

JonathonMisiewicz commented 1 year ago

First batch of questions:

Can TREXIO be extended to support other C++ types, such as complex or double? If a section has no data, e.g., Psi doesn't use a cell or periodic boundary calculations, I assume that consumes no memory? Is there a way for us to store an intermediate with an arbitrary name? For example, let's say that we have a coupled cluster code that needs to store on disk not only the T2 amplitudes but an amplitude called W. Can we do that?

q-posev commented 1 year ago

Hi @JonathonMisiewicz

Can TREXIO be extended to support other C++ types, such as complex or double?

double type is fully supported (it is the default for floats or can be explicitly accessed by using the _64 suffix in the API). In fact, we fully support 32- and 64-bit integers and floats as well as strings and arrays of strings. The complex type is supported implicitly, namely the real and imaginary parts can be written in two independent calls to the TREXIO API.

If a section has no data, e.g., Psi doesn't use a cell or periodic boundary calculations, I assume that consumes no memory?

Exactly, it is up to the user to decide which data to store.

Is there a way for us to store an intermediate with an arbitrary name? For example, let's say that we have a coupled cluster code that needs to store on disk not only the T2 amplitudes but an amplitude called W. Can we do that?

I think you would need to modify a format for that, only items listed in the trex.org (trex.json) can be written since the source code of the library is auto-generated from the format specification. @scemama please correct me if I am wrong.

JonathonMisiewicz commented 1 year ago

Okay, great. If the answer to the third question is "no", then I'd be interested to have Psi4 support TREXIO, but it's not going to solve our "big picture" I/O problem.

scemama commented 1 year ago

Hi! thank you, this is an interesting point and I am sure other code developers will raise the same issue.

One of the main goals of TREXIO is to make it easy for different codes to exchange data. However, I understand that Psi4 may have specific needs that may not be included in the current version of TREXIO. One solution could be to use HDF5 for temporary files and store the final results with TREXIO.

But: a better option is to fork TREXIO and extend it to fit Psi4's specific needs. You could add your personal temporary arrays in a specific group named psi4 for example, and link your own library with psi4. Additionally, if any modifications prove to be useful to other codes, they can be submitted as pull requests to the official library.

The good thing with this strategy is that as long as you don't remove anything from the trex.org file, the files that you will produce will be detected as valid TREXIO files. Of course, there will be no way to access your specific data with the official library, but the files will be compatible with both the official and the custom library.

@q-posev : We could think of a mechanism to generalize the possibility to extend the library for private data. Instead of reading only trex.json, we could let the script handle multiple json files to allow users to extend the library with custom groups. It could probably be integrated at the level of the configure script. In this way, @JonathonMisiewicz would only need to keep a JSON file in the git repo of psi4 to extend TREXIO instead of maintaining a fork of TREXIO and keeping his fork in sync with the official one.

JonathonMisiewicz commented 1 year ago

Thanks for the response. Psi is no stranger to forking our dependencies to fit our needs.

Remember that one of our requirements is ease of use. For example, let's take our dfmp2 code. We need to store over 18 different intermediates on disk, most of which are meaningless outside of the context of DFMP2. Needing to have a JSON file listing all intermediates, and making sure that the intermediate names of different modules don't clash, are problems we don't have in the current code, and introducing those make ease-of-use more complicated. That said, I see the merit in having a unified listing of all intermediates, at least on a per-module level. I'll think more about this, and of course, I can only speak for myself, not all Psi core developers.

scemama commented 1 year ago

OK, I understand. In that case, maybe writing your own wrapper around HDF5 for temporary files would be a better option, because you would be able to pass strings to functions to specify the data you manipulate, while in TREXIO we have different functions for different data. So creating a new intermediate in the code would be straightforward.

JonathonMisiewicz commented 1 year ago

Just to make sure I understand how the library works: During the installation procedure, there will be a trex.json file added. The contents of this file change the groups and variables available within each group. (For Psi devs, this is equivalent to libpsio file and libpsio entry name.) So by editing the file and then re-compiling (make, make check and then make install?), we can edit the entries available to trexio.

Is that all right?

scemama commented 1 year ago

What you say seems exact. Just to be sure: when you are in "developer mode" (you get the library from the GitHub repo, not the tar.gz), when you run make Emacs parses an org-mode file and creates a JSON from the tables. Then, this JSON file is read by a Python script to generate C functions and headers, and the Fortran and Python interfaces. The names of the functions are trexio_<group>_write_<attribute>.

So you can edit the trex.org file to add extra info to the JSON. There are 2 possibilities:

You create at the top of the file, just before the Metadata section a block like:

#+begin_src python :tangle trex.json
   "psi4mp2": {
        "w" : [ "float sparse", [ "mo.num", "mo.num", "mo.num", "mo.num" ]],
        "t1" : [ "float", [ "mo.num", "mo.num" ]],
   },
   "psi4ccsd": {
        "w" : [ "float sparse", [ "mo.num", "mo.num", "mo.num", "mo.num" ]]
   },
#+end_src

Or you create a section in the Org-mode syntax like


* Psi4
This section documents the temporary arrays specific to psi4

** DFMP2 (psi4dfmp2 group) Here, we specify the data for DFMP2....

[\ t = \sum{ij} ... ] [ W = \sum{ijab} ... ]

+NAME: psi4dfmp2

| Variable | Type | Dimensions | Description | |------------+----------------+-----------------------------------+-----------------------------| | ~w~ | ~float sparse~ | ~(mo.num,mo.num,mo.num,mo.num)~ | W in the equation above | | ~t~ | ~float~ | ~(mo.num,mo.num)~ | t in the equation above |

+CALL: json(data=psi4dfmp2, title="psi4dfmp2")

** CCSD (psi4ccsd group) Here, we specify the data for CCSD....

+NAME: psi4ccsd

| Variable | Type | Dimensions | Description | |------------+----------------+-----------------------------------+-----------------------------| | ~w~ | ~float sparse~ | ~(mo.num,mo.num,mo.num,mo.num)~ | W in the equation above |

+CALL: json(data=psi4ccsd, title="psi4ccsd", last=1)


Now if in Emacs you execute "Ctrl-C Ctrl-C" when your cursor is on the line "CALL:json ...", it will automatically generate the JSON code from the data of the table and put it in the file, similarly to what happens when you are using a Jupyter Notebook and you evaluate a cell.

Note: the `last=1` argument handles the presence/absence of a comma in the generated JSON. So `last=1` should be present only in the very last JSON block of the file. 
When you compile the library, this will generate the functions `trexio_[read|write|has]_psi4mp2_w` and `trexio_[read|write|has]_psi4ccsd_w`.

JonathonMisiewicz commented 1 year ago

Great. As a proof-of-concept, I'll see if I can migrate our dfmp2 code over trexio. I have other priorities, but I'm hoping I can get this by month's end.

scemama commented 1 year ago

If you need any help, feel free to ask :-)

JonathonMisiewicz commented 1 year ago

The month proved busier than expected.

As I'm thinking this over, one of our current I/O system's capabilities is letting the user choose the name of the file to read/write. Through the Matrix class, this is part of our end user API. Admittedly, it's an obscure one.

Is there any reasonable chance of trexio supporting this?

q-posev commented 1 year ago

The Psi4 user /developer can choose the name of the trexio file. It is the first argument of the trexio_open function (in C/Fortran) or trexio.File class constructor (in Python). So it can be propagated to your I/O back end from the Psithon front end. Does this answer your question? Was not sure I understand it.

One can even manipulate different TREXIO files at the same time if needed (e.g. reading data from different files and aggregating it somehow to produce a new TREXIO file).

JonathonMisiewicz commented 1 year ago

Psi and TREXIO use the word "file" differently, so let me reword.

Right now, a user can do the equivalent of saying to save a matrix as a variable in a group, as long as the group is pre-defined. The user can create a completely new variable in an existing group if they so choose. Is there a way for us to retain that functionality with TREXIO?

q-posev commented 1 year ago

Now I see, thanks. No, this is not possible with the current version of trexio which is tightly coupled to the corresponding format defined in trex.json. There is no way to write an arbitrary variable until it's defined in the format (we have internal consistency checks for the sizes of the matrices for example in order to prevent inconsistent data).

@scemama We could probably add a functionality allowing to write an arbitrary variable in e.g. "external" group via generic trexio_write|read_(file, variable-str, datatype-str, size-max). I can implement it easily for the HDF5 back end but TEXT one is more tricky.

scemama commented 1 year ago

@q-posev Good idea! I created an issue on TREXIO where we can exchange about that: https://github.com/TREX-CoE/trexio/issues/112

JonathonMisiewicz commented 1 year ago

Great. I'll continue on the dfmp2 proof-of-concept, probably over the weekend.

scemama commented 1 year ago

Hi @JonathonMisiewicz, Have you heard about the ESCDF library? It might be better adapted to what you want to do. It is also based on HDF5, but it is more low-level and flexible than TREXIO. See https://th.fhi-berlin.mpg.de/site/uploads/Publications/Oliveira_The_CECAM_electronic.pdf section G page 153. I have never tried it, so I have no opinion on how easy it is to use.

JonathonMisiewicz commented 1 year ago

No, I'm not familiar with it. I'll give it a look, thanks!

I want to improve our TDDFT capabilities a bit before getting back into the I/O problem.

q-posev commented 1 year ago

@JonathonMisiewicz @scemama I finally got some time for the proof-of-concept implementation of generic I/O in TREXIO. Only numerical (int/float) arrays for now and only for the HDF5 back end, but this should be enough to get an idea whether TREXIO is a suitable candidate for the I/O back end of Psi4.

The PR for the associated add-external-group branch is here: https://github.com/TREX-CoE/trexio/pull/117

The API calls are slight different from the conventional TREXIO API. A few examples are:

Let me know if you have any questions or comments.

psi4 / psi4

Interface with TREXIO #2847

+NAME: psi4dfmp2

+CALL: json(data=psi4dfmp2, title="psi4dfmp2")

+NAME: psi4ccsd

+CALL: json(data=psi4ccsd, title="psi4ccsd", last=1)