Grid and matrices output

pfebrer commented 1 year ago

I would like to create an Aiida database where the most important thing are the sparse matrices of each SIESTA calculation. I guess for now these are kept as files, but this makes it more difficult to interact with them. I would like that they were integrated as calculation outputs.

With sisl, it is very easy to convert those files into python data. I have seen that there is no built-in data type for sparse matrices in aiida, and also I have found no plugin that includes one (?). So I guess we would have to create a data type. Mimicking the sisl sparse matrices would be the most convenient I would say.

The grids can also be read from sisl and I guess those could be stored as ArrayData?

pfebrer commented 1 year ago

By the way, I could do it myself, I would just need to know whether it makes sense and some pointers maybe on things to take into account for the implementation.

bosonie commented 1 year ago

Hi Pol. The data/ion.py in this repo implements IonData that is not only a Data class (can be stored and so on), but also directly derives from SingleFileData that allows to access various functionalities to manage files. Following a similar path would solve the basic task of making a data class for the file containing sparse matrices. However, in order to be more useful, this class should also implement all the quantities that will be important for data query, as attributes. In IonData we implemented element, name and atomic_number as attributes. You can implement as many as you wish, and use sisl to extract the quantities you want. Hope this help. Write back for any doubt.

pfebrer commented 1 year ago

Hi, thanks Emanuele!

The point is that I don't think that there is any advantage on keeping it as a file, since matrix and grid files that come as outputs from SIESTA calculations are basically arrays of floats. Storing them in the database as arrays instead of files should be much better for any task that you can imagine, no?

The SingleFileData seems like a good patch for files that have no simple representation in python (or has not been implemented yet) , but for other things it seems unnecessarily obscure. E.g. you could also store output energies as SingleFileData, but this would make everything worse.

bosonie commented 1 year ago

Ok, then. Just make sure that the amount of data in the database is not too big. Regarding the implementation, you can make a Data class that derives from ArrayData, and uses sisl to parse the data. You can also implement methods of the class using sisl. It should be very easy.

pfebrer commented 1 year ago

Just make sure that the amount of data in the database is not too big.

So SingleFileData does not store the data inside the database?

The database will inevitably get large if you store the matrices. I don't mean to make it a compulsory output, just an optional thing. If you need the matrices of course you have to pay this price. The other alternative is to store the matrices in directories, which also occupies a lot of space. I was thinking of Aiida for this task to have everything in a compact database, which then is much easier to copy and move around (instead of having to zip a directory with thousands of runs).

Or are there problems inherently related to the size of the database?

bosonie commented 1 year ago

AiiDA stores data in two places, a PostgreSQL database and a disk-objectstore repository. Read here: https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/storage/psql_dos.html#internal-architecture-storage-psql-dos Whatever data that is define as an attribute of your Data class is stored in the PostreSQL database (the data must be json-serializable). SingleFileData, instead, helps you to store files in the disk-objectstore repository. The data (both on the database and the repository) are always accessible from AiiDA. The only difference is that the query system implemented in AiiDA allows to create queries much more easily using data in the database. To explain it in an even simpler way, we can say it is very easy to find in the database a Data that has name as value attribute key and name_2 as value of key_2 , while it is more difficult to query for Data whose associated file contains the word blablabla and blablabla2. A part from that, from my knowledge, there is no relevant point to consider when choosing if a particular data should go in the repository or in the database. And of course the fact that the disk-objectstore has been designed exactly to host big files, so I imagine that for performance consideration the big data should go there. However we can ask @chrisjsewell for additional info if required. Regarding the way the data will be exported, you can read here: https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/storage/sqlite_zip.html#internal-architecture-storage-sqlite-zip

pfebrer commented 1 year ago

I'm starting to understand how things work. I see that the SiestaCalculation class instructs which files should be retrieved here: https://github.com/siesta-project/aiida_siesta_plugin/blob/76d04250504638d8edc88a2f915690af9ff3796b/aiida_siesta/calculations/siesta.py#L700-L720

So I should add a line to retreive the DM, right?

From aiida documentation, I understand that Aiida retreives all these files and stores them in the local repository. Then, parsing the data from files is implemented here: https://github.com/siesta-project/aiida_siesta_plugin/blob/develop/aiida_siesta/parsers/siesta.py, so this is where I should implement the parsing, right?

My question is: If the files have already been stored to the repository as outputs of the CalcNode and then you go and parse them to create a parsed output (e.g. you parse the siesta.bands file to create a BandsData node as output) isn't the data then stored twice? (I have to say I don't understand at which point the parser is run).

By the way I realized that some things in the parser are already implemented in sisl, so the code could be simplified by using sisl.

chrisjsewell commented 1 year ago

However we can ask @chrisjsewell for additional info if required.

Heya just a quick note, it's better to store in in the repository, i.e. as a binary object, which will get loaded by the data node. Since, postgresql it's not efficient at storing such objects. In theory, aiida is designed to have pluggable storage backend, and so you could think of a more efficient storage technology for such arrays, But yeh that is not something currently available

pfebrer commented 1 year ago

Thanks! I already saw that ArrayData actually stores npy files in the repository.

I would actually then just store the file without parsing if it wasn't because it's a fortran unformatted binary file. sisl parses them using fortran directly and it can only be done by providing a file path, I can't use the bytes stream that Aiida gives you. Is the only solution to copy the contents to a temporary file or can you think of a better solution?

bosonie commented 1 year ago

So I should add a line to retreive the DM, right?

Yes

From aiida documentation, I understand that Aiida retreives all these files and stores them in the local repository. Then, parsing the data from files is implemented here: https://github.com/siesta-project/aiida_siesta_plugin/blob/develop/aiida_siesta/parsers/siesta.py, so this is where I should implement the parsing, right?

Correct My question is: If the files have already been stored to the repository as outputs of the CalcNode and then you go and parse them to create a parsed output (e.g. you parse the siesta.bands file to create a BandsData node as output) isn't the data then stored twice? (I have to say I don't understand at which point the parser is run).

I do not remember if the 'retrieved' files are also stored in the repository, I would say no, but maybe I'm wrong. In any case the point to parse the data is to create the Data objects. As I explained, usually they are simple data in the database. Only using SingleFileData (or its children objects), we associate the Data node directly to a file. For instance, I believe that the BandaData are not associated to any file. The data are stored in arrays, and BandsData gives you possibilities that are very difficult to reach starting only from the siesta .bands file. In fact it stores explicitly all the kpoints associated to the bands, allows you easy plotting, without forgetting the HUGE concept of allowing interoperability amond different DFT codes.

By the way I realized that some things in the parser are already implemented in sisl, so the code could be simplified by using sisl.

It might be that that parser came before the sisl existence. Feel free to open a PR with simplifications if you feel it is useful AND you are sure that sisl is the direction where Siesta is going (did you convinced at least few more people to embrace sisl? :smile: )

bosonie commented 1 year ago

Thanks! I already saw that ArrayData actually stores npy files in the repository.

I would actually then just store the file without parsing if it wasn't because it's a fortran unformatted binary file. sisl parses them using fortran directly and it can only be done by providing a file path, I can't use the bytes stream that Aiida gives you. Is the only solution to copy the contents to a temporary file or can you think of a better solution?

You can just store the file I believe, using SimpleFileData. To return the content is not mandatory. And you can implement other stuffs. Just make a children class of SimpleFileData and implement whatever you want!. I'm sure you will realize that some stuffs can be improved also on the sisl side, but since you can modify both codes, I would consider this an easy task, unless there is something I'm missing.

pfebrer commented 1 year ago

It might be that that parser came before the sisl existence.

True haha

Just make a children class of SimpleFileData and implement whatever you want!. I'm sure you will realize that some stuffs can be improved also on the sisl side, but since you can modify both codes, I would consider this an easy task, unless there is something I'm missing.

It is not so easy because these files are read from fortran. And fortran needs to read from a file system (you must provide a path). I have not found in the internet any example of interaction between a python BytesIO and fortran. Copying the contents to a temporary file is not a good option because I need fast access to this data 😅

bosonie commented 1 year ago

Ok, so two points to understand for me:

Can you save in the AiiDA repository the binary files produced by siesta? I guess yes. In that case, the 'parsing' part should be solved. Just make a SimpleFileData and you are done.
How do you access the content of this SimpleFileData? I know that the AiiDA standard way returns a a python BytesIO, but probably there is some hack to access the file location directly, no? @chrisjsewell maybe can help.

pfebrer commented 1 year ago

Yes
Judging by Aiida's documentation, there isn't:

As a result, the files cannot be accessed directly using file system tools, despite the fact that they are stored somewhere on the local file system. Instead, you should interact with the repository through the API.

ahkole commented 6 months ago

Since AiiDA v2.5.0 it is now possible to get a path to files stored in the repository, see https://aiida.readthedocs.io/projects/aiida-core/en/stable/reference/_changelog.html#repository-interface-improvements . Would this maybe help your use case @pfebrer ? I have not yet installed this version and tried it myself so I don't know what kind of filepath you get and if it can be parsed by the Fortran parsers from sisl.

pfebrer commented 6 months ago

Hmm yeah it looks like it could work, I don't have the time to try now though :sweat_smile:

siesta-project / aiida_siesta_plugin

Grid and matrices output #124