Bandgap finetune data - Githubissues

jiali1025 commented 1 year ago

Thanks for these wonderful studies. I just wonder whether the bandgap fine-tune data's json can be provided since in the provided data linke https://figshare.com/articles/dataset/MOFTransformer/21155506?file=37511755, I only find the data for H2 uptake (100 bar) and H2 diffusivity.

Thanks!

jiali1025 commented 1 year ago

In addition, I find for the PMtrans, the finetune dataset of cof is smaller than the original dataset size provided in the reference. Is there any criteria to filter the data?

hspark1212 commented 1 year ago

Hi @jiali1025!

Because the fine-tune data for band gap is from the QMOF database, we have decided not to include it due to copyright concerns. So I would recommend if you could create the data json file with QMOF database on your own.

hspark1212 commented 1 year ago

As you mentioned, we specifically utilized the calculated band gap values from a subset of 400 data points within the CURATED COF database. (The overall database contains a larger number of registered structures, approximately 600 in total.) However, when we obtained the data from the reference group, we were provided with only 400 data points which included both the structure files (cif files) and the corresponding calculated band gap. So we didn't apply any filter.

If you have any further questions, please feel free to ask.

jiali1025 commented 1 year ago

Thanks for the reply, actually I mean the table 2 results. I got the data from reference 41 and 42. They have more than 68000 structures and properties. However, in table 2 you wrote 39,304 is used. I am not sure how did you reduce from 68000 to 39304. In addition, now I find the moftransformer pip package can not be used, I think the pip is depreciated which cause a lot of problems

hspark1212 commented 1 year ago

The decrease in the number of finetuning datasets shown in Table 2 can be attributed to our pre-processing code, which is implemented in prepare_data.py. When preparing data for the MOFTransformer, we have several constraints to consider, such as the maximum number of atoms in a primitive cell and the lengths of the cells.

To assist us in addressing the issue, it would be beneficial if you could provide details about the specific problems you encountered.

jiali1025 commented 1 year ago

Hi the issue actually is when I use pip install it will show it can not use, but pip3 install is fine.

jiali1025 commented 1 year ago

I am a bit confused about the mof cif name. Did you remove the clean from the coremof cif? Also did you replace . with + ? Or is the downstream fine-tune data's cif is different from coremof and qmof naming principle?

Yeonghun1675 commented 1 year ago

Hi, @jiali1025!

We downloaded it in a virtual environment and it installed fine with pip. Could you please let us know what your Linux environment is, or make sure the default Python environment doesn't start with 2.XX?

Yeonghun1675 commented 1 year ago

The cif_name is the name of the cifs minus the .cif existed in root_dataset or root_cif. For example, if your cif file is IRMOF-1.cif, the cif_name would be IRMOF-1. Note that the + in the example is the name generated from hMOF generation (using PORMAKE) and is not related to cif_name.

Yeonghun1675 commented 1 year ago

In this case, if a particular CIF has a very large structure, or if there is a problem with the structure, it is demanding high resources during the prepare_data process. We have found that this happens mainly when there is an atom-overlap. If you have such a structure, we recommend running prepare_data without that structure.

jiali1025 commented 1 year ago

Thanks so much for your prompt reply! My computer is now prepare data for the cofs, I will try to reproduce the error to you after it finish running. It may due to the VPN of my university.

jiali1025 commented 1 year ago

Thanks so much! I now fully understand the dataset, so you have several structure data from different databases and need to pre-embed them according to different folders and prepare different jsons. I just want to avoid repeated store a lot of repeated data pre-embeddings as they are quite large.

jiali1025 commented 1 year ago

Thanks! I have made my swap to be 50G then it can run now.

hspark1212 commented 1 year ago

Nice to hear that it works well! Good luck !

jiali1025 commented 1 year ago

Hi bro sorry to bother you again, I found after process to about 4k cofs, it has OOM problem again. Maybe there are some crazy structures. Is it possible that you can share the code to filter out these crazy structures? Thanks!

hspark1212 commented 1 year ago

Hi @jiali1025 , can you tell me when you encountered OOM problem, for example, running prepare_data.py or running run.py ?

When we ran the prepare_data.py, the OOM error didn't occur.

jiali1025 commented 1 year ago

I mean the prepare_data.py. This OOM is not the OOM of cuda, it is the OOM of the RAM. I think as you mentioned, some of the cofs has a lot of atoms or they are strange. However, I am not sure how did you filter these structures out since I can get about 70000 structures from the reference you provided, but you only use about 40000. I just want to align better with your studies.

Many thanks!

hspark1212 commented 1 year ago

Thank you for your comment!

We just dismissed the COF structures where the running of the prepare_data.py failed. So we didn't applied any filter except for the prepare_data.py.

jiali1025 commented 1 year ago

Thanks for your information, could you let me know what's your hardware requirements. Will the prepare_data.py accumulate data in RAM?

Yeonghun1675 commented 1 year ago

Hi @jiali1025, prepare_data does not proceed to accumulate data. The most problematic part of your function run is probably the GRIDAY program that generates the grid data (it is written in C++). Probably, when you run prepare_data, you will be left with the energy grid log, and you can check this, maybe you can check the cif file when the OOM happens (probably, in order, after the last success).

We used to have code in our code to limit the whole atom, but we dropped that when we switched to PMTransformer. If you look at the file with the OOM and a high atom count causes it, try pre-filtering by atom count before running prepare_data.

hspark1212 / MOFTransformer

Bandgap finetune data #139