Some EMD IDs in the train split do not exist

jianlin-cheng / Cryo2Struct

Deep learning tools for converting cryo-EM density maps to protein structures

MIT License

2 stars 1 forks source link

Some EMD IDs in the train split do not exist #2

Open RodenLuo opened 5 months ago

RodenLuo commented 5 months ago

Hi,

I downloaded from here the splits info and here the full dataset.

Some EMD IDs (full list below) exist in the train split but not in the dataset. Did I make any mistakes or those were removed later on?

Thanks

RodenLuo commented 5 months ago

Similarly, for valid:

Also, I could not find any of the test split IDs in the dataset.

nabingiri commented 5 months ago

Hello RodenLuo, I have updated the 'metadata' file here, could you please use them.

RodenLuo commented 4 months ago

Hi Nabin, Sorry for the late reply. Was traveling to several conferences.

The problem still exists on my side. I'm using the previously downloaded EMD folder and the new metadata file. I notice this time that there are two kinds of issues. One is, e.g., "2278" is the first in the TEST tab, but it is not inside the EMD folder. The second is, e.g., "903" is in the VALID tab, but only "0903" is in the EMD folder.

I attached the output of ls EMD > EMD_list.txt and the IDs in each split on my end for your reference.

EMD_list.txt split_valid_new.txt split_train_new.txt split_test_new.txt

nabingiri commented 4 months ago

Hello @RodenLuo,

First issue: The EMD-ID present in TEST tab of the metadata.xlsx file are the IDs of test data. These test data were filtered out from the Full Dataset. The Full Dataset is used for training and validating the models. The test data files are available in another repository : https://doi.org/10.7910/DVN/2GSSC9 .
Second issue: Here's what happened: the number 903 shown in the VALID tab should actually be 0903. Somehow, the leading zeros were accidentally removed in the Excel sheet. Now it's fixed, ensuring that there are always four digits in the EMD-ID name. The fixed excel sheet is available here : https://doi.org/10.7910/DVN/JMN60H .