Open pfebrer opened 1 year ago
By the way, I could do it myself, I would just need to know whether it makes sense and some pointers maybe on things to take into account for the implementation.
Hi Pol.
The data/ion.py
in this repo implements IonData
that is not only a Data
class (can be stored and so on), but also directly derives from SingleFileData
that allows to access various functionalities to manage files.
Following a similar path would solve the basic task of making a data class for the file containing sparse matrices. However, in order to be more useful, this class should also implement all the quantities that will be important for data query, as attributes. In IonData
we implemented element
, name
and atomic_number
as attributes. You can implement as many as you wish, and use sisl
to extract the quantities you want.
Hope this help. Write back for any doubt.
Hi, thanks Emanuele!
The point is that I don't think that there is any advantage on keeping it as a file, since matrix and grid files that come as outputs from SIESTA calculations are basically arrays of floats. Storing them in the database as arrays instead of files should be much better for any task that you can imagine, no?
The SingleFileData
seems like a good patch for files that have no simple representation in python (or has not been implemented yet) , but for other things it seems unnecessarily obscure. E.g. you could also store output energies as SingleFileData
, but this would make everything worse.
Ok, then. Just make sure that the amount of data in the database is not too big.
Regarding the implementation, you can make a Data
class that derives from ArrayData
, and uses sisl
to parse the data. You can also implement methods of the class using sisl
. It should be very easy.
Just make sure that the amount of data in the database is not too big.
So SingleFileData
does not store the data inside the database?
The database will inevitably get large if you store the matrices. I don't mean to make it a compulsory output, just an optional thing. If you need the matrices of course you have to pay this price. The other alternative is to store the matrices in directories, which also occupies a lot of space. I was thinking of Aiida for this task to have everything in a compact database, which then is much easier to copy and move around (instead of having to zip a directory with thousands of runs).
Or are there problems inherently related to the size of the database?
AiiDA stores data in two places, a PostgreSQL database and a disk-objectstore repository.
Read here: https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/storage/psql_dos.html#internal-architecture-storage-psql-dos
Whatever data that is define as an attribute of your Data
class is stored in the PostreSQL database (the data must be json-serializable). SingleFileData
, instead, helps you to store files in the disk-objectstore repository.
The data (both on the database and the repository) are always accessible from AiiDA. The only difference is that the query system implemented in AiiDA allows to create queries much more easily using data in the database. To explain it in an even simpler way, we can say it is very easy to find in the database a Data
that has name
as value attribute key
and name_2
as value of key_2
, while it is more difficult to query for Data
whose associated file contains the word blablabla
and blablabla2
.
A part from that, from my knowledge, there is no relevant point to consider when choosing if a particular data should go in the repository or in the database. And of course the fact that the disk-objectstore has been designed exactly to host big files, so I imagine that for performance consideration the big data should go there.
However we can ask @chrisjsewell for additional info if required.
Regarding the way the data will be exported, you can read here: https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/storage/sqlite_zip.html#internal-architecture-storage-sqlite-zip
I'm starting to understand how things work. I see that the SiestaCalculation
class instructs which files should be retrieved here: https://github.com/siesta-project/aiida_siesta_plugin/blob/76d04250504638d8edc88a2f915690af9ff3796b/aiida_siesta/calculations/siesta.py#L700-L720
So I should add a line to retreive the DM, right?
From aiida
documentation, I understand that Aiida retreives all these files and stores them in the local repository. Then, parsing the data from files is implemented here: https://github.com/siesta-project/aiida_siesta_plugin/blob/develop/aiida_siesta/parsers/siesta.py, so this is where I should implement the parsing, right?
My question is: If the files have already been stored to the repository as outputs of the CalcNode
and then you go and parse them to create a parsed output (e.g. you parse the siesta.bands
file to create a BandsData
node as output) isn't the data then stored twice? (I have to say I don't understand at which point the parser is run).
By the way I realized that some things in the parser are already implemented in sisl
, so the code could be simplified by using sisl
.
However we can ask @chrisjsewell for additional info if required.
Heya just a quick note, it's better to store in in the repository, i.e. as a binary object, which will get loaded by the data node. Since, postgresql it's not efficient at storing such objects. In theory, aiida is designed to have pluggable storage backend, and so you could think of a more efficient storage technology for such arrays, But yeh that is not something currently available
Thanks! I already saw that ArrayData
actually stores npy
files in the repository.
I would actually then just store the file without parsing if it wasn't because it's a fortran unformatted binary file. sisl
parses them using fortran directly and it can only be done by providing a file path, I can't use the bytes stream that Aiida gives you. Is the only solution to copy the contents to a temporary file or can you think of a better solution?
So I should add a line to retreive the DM, right?
Yes
From
aiida
documentation, I understand that Aiida retreives all these files and stores them in the local repository. Then, parsing the data from files is implemented here: https://github.com/siesta-project/aiida_siesta_plugin/blob/develop/aiida_siesta/parsers/siesta.py, so this is where I should implement the parsing, right?Correct My question is: If the files have already been stored to the repository as outputs of the
CalcNode
and then you go and parse them to create a parsed output (e.g. you parse thesiesta.bands
file to create aBandsData
node as output) isn't the data then stored twice? (I have to say I don't understand at which point the parser is run).I do not remember if the 'retrieved' files are also stored in the repository, I would say no, but maybe I'm wrong. In any case the point to parse the data is to create the
Data
objects. As I explained, usually they are simple data in the database. Only usingSingleFileData
(or its children objects), we associate theData
node directly to a file. For instance, I believe that theBandaData
are not associated to any file. The data are stored in arrays, andBandsData
gives you possibilities that are very difficult to reach starting only from the siesta .bands file. In fact it stores explicitly all the kpoints associated to the bands, allows you easy plotting, without forgetting the HUGE concept of allowing interoperability amond different DFT codes.By the way I realized that some things in the parser are already implemented in
sisl
, so the code could be simplified by usingsisl
.
It might be that that parser came before the sisl existence. Feel free to open a PR with simplifications if you feel it is useful AND you are sure that sisl is the direction where Siesta is going (did you convinced at least few more people to embrace sisl? :smile: )
Thanks! I already saw that
ArrayData
actually storesnpy
files in the repository.I would actually then just store the file without parsing if it wasn't because it's a fortran unformatted binary file.
sisl
parses them using fortran directly and it can only be done by providing a file path, I can't use the bytes stream that Aiida gives you. Is the only solution to copy the contents to a temporary file or can you think of a better solution?
You can just store the file I believe, using SimpleFileData
. To return the content is not mandatory. And you can implement other stuffs. Just make a children class of SimpleFileData
and implement whatever you want!. I'm sure you will realize that some stuffs can be improved also on the sisl side, but since you can modify both codes, I would consider this an easy task, unless there is something I'm missing.
It might be that that parser came before the sisl existence.
True haha
Just make a children class of SimpleFileData and implement whatever you want!. I'm sure you will realize that some stuffs can be improved also on the sisl side, but since you can modify both codes, I would consider this an easy task, unless there is something I'm missing.
It is not so easy because these files are read from fortran. And fortran needs to read from a file system (you must provide a path). I have not found in the internet any example of interaction between a python BytesIO
and fortran. Copying the contents to a temporary file is not a good option because I need fast access to this data 😅
Ok, so two points to understand for me:
SimpleFileData
and you are done.SimpleFileData
? I know that the AiiDA standard way returns a a python BytesIO
, but probably there is some hack to access the file location directly, no? @chrisjsewell maybe can help.As a result, the files cannot be accessed directly using file system tools, despite the fact that they are stored somewhere on the local file system. Instead, you should interact with the repository through the API.
Since AiiDA v2.5.0 it is now possible to get a path to files stored in the repository, see https://aiida.readthedocs.io/projects/aiida-core/en/stable/reference/_changelog.html#repository-interface-improvements . Would this maybe help your use case @pfebrer ? I have not yet installed this version and tried it myself so I don't know what kind of filepath you get and if it can be parsed by the Fortran parsers from sisl.
Hmm yeah it looks like it could work, I don't have the time to try now though :sweat_smile:
I would like to create an Aiida database where the most important thing are the sparse matrices of each SIESTA calculation. I guess for now these are kept as files, but this makes it more difficult to interact with them. I would like that they were integrated as calculation outputs.
With
sisl
, it is very easy to convert those files into python data. I have seen that there is no built-in data type for sparse matrices in aiida, and also I have found no plugin that includes one (?). So I guess we would have to create a data type. Mimicking thesisl
sparse matrices would be the most convenient I would say.The grids can also be read from sisl and I guess those could be stored as
ArrayData
?