train_smiles and novel_smiles have likelihood of having an invalid SMILES as its element. Unlike np_fps which has no representation for those invalid SMILES. If a molecule is found to be invalid, it's ignored (line 49).
This wasn't being taken into account when creating labels labels = [1] * len(train_smiles) + [0] * len(novel_smiles). Fixed it now, by appending values to labels inside of the loop that's appending values to np_fps.
train_smiles
andnovel_smiles
have likelihood of having an invalid SMILES as its element. Unlikenp_fps
which has no representation for those invalid SMILES. If a molecule is found to be invalid, it's ignored (line 49).This wasn't being taken into account when creating labels
labels = [1] * len(train_smiles) + [0] * len(novel_smiles)
. Fixed it now, by appending values tolabels
inside of the loop that's appending values tonp_fps
.