gnina / libmolgrid

Comprehensive library for fast, GPU accelerated molecular gridding for deep learning workflows
https://gnina.github.io/libmolgrid/
Apache License 2.0
137 stars 45 forks source link

The ExampleProvider populate part is not working properly. #105

Closed drorhunvural closed 1 year ago

drorhunvural commented 1 year ago

I have a train.types file with five labels (1, 2, 3, and 4,5), as shown below I'm trying to create an ExampleProvider to populate traintypes. The populate part seems to work when I print the size as in the code below. My question is, why doese.num_labels()return a value of 4? It has to be 5. Really weird.

4 96.57854128835454 35.182186111609546 47.35858675758119 4zsl_protein_nowat.gninatypes
1 15.384651055685152 0.1055870566421437 -7.609890663317186 5c7n_protein_nowat.gninatypes
2 -36.32161417155034 33.20023494938732 11.454502077783696 4bdb_protein_nowat.gninatypes
2 56.45320141777738 3.60688178113592 10.212373461244589 6fo5_protein_nowat.gninatypes
3 19.59233057392693 83.04969463802207 4.276684761862489 3zcl_protein_nowat.gninatypes
1 7.175476163847336 -10.649742206067383 12.110129595973715 5izc_protein_nowat.gninatypes
5 -8.74497007306576 1.012669402462216 14.819772695425884 2vvv_protein_nowat.gninatypes
...
gninatypes_folder = '/allgninatypesfolder'
traintypes = '/train.types'
testtypes = '/test.types'
molcache_file = '/train.molcache'

molgrid.set_random_seed(0)
torch.manual_seed(0)
np.random.seed(0)

batch_size = 50

e = molgrid.ExampleProvider(data_root=gninatypes_folder ,shuffle=True, recmolcache = molcache_file ,stratify_receptor=True)
e.populate(traintypes)
print("Size: ", e.size())
print("Num_Types: ", e.num_types())
print("Num Labels: ",e.num_labels())

Output Size: 2374 Num Types: 28 Num Labels: 4 (Wrong!)

dkoes commented 1 year ago

There are four labels in the file you provide above. The receptor structure is not a label.

drorhunvural commented 1 year ago

Doesn't the first part represent the label or am I wrong? The first part of the file I mentioned above represents the label, and there are 1, 2, 3, 4, and 5. Actually, this file is a file of 3000 lines, and there are about 20% of each different label.

dkoes commented 1 year ago

Each line starts with four numbers. Those are the labels. There are four of them. If what you want is the number of unique values for a given label, you will need to compute that yourself by iterating over the dataset.

drorhunvural commented 1 year ago

"Each line starts with four numbers." I don't understand what you mean by this sentence. I have uploaded a small sample, which I claim has five different labels. The ".types" file has a total of 104 lines.

What I should understand from your sentence is that molgrid only allows up to 4 different labels without doing anything extra ?

Note Edit: My problem is to classify on a dataset that has 5 different labels, that is, to use CNN. I specify each different label in the first column of the types file and they are listed from 1 to 5.

train_try.zip

drorhunvural commented 1 year ago

In the answer you gave here #96, you said that the first column represents the label. I say that I have increased these labels to 5 and I say that I have 1,2,3,4,5 different labels, but you say that you have 4 labels, I do not understand it. :)

dkoes commented 1 year ago

In that instance the first column was the binary classification label. It wasn't the only label, nor did I say it was. Each line in your input file is an example and each example has four labels. 4 96.57854128835454 35.182186111609546 47.35858675758119 4zsl_protein_nowat.gninatypes 4 - first label 96.57854128835454 - second label 35.182186111609546 - third label 47.35858675758119 - fourth label Hence, num_labels is 4.

drorhunvural commented 1 year ago

Thank you very much for the information, I thought e.num_labels() was showing the different labels in my first column because until today we were always dealing with 4 different classifacation problems. It's good to know that before publishing our paper and referencing the molgrid paper. It is very important that you answer our questions here. I appreciate you guiding us in the right direction.