Open W-OK-E opened 6 months ago
Rephrasing the issue properly
Not all molecules can be expressed by the default fragment set used in the code base. Normally the search stops after a while if that's the case, but I don't recall it taking up to 50 minutes. If you can give more specific examples I may be able to give better insights.
Yes, the _recursive_decompose function in frag_mol_env.py is meant to stop if it takes more than 1000 recursive iterations, but I removed it in my fork as it was being a bottle neck, and even then after about an hour, I get Nonetype for the molecules that I am trying to convert.
Here's how I was trying to generate graphs:
objs = [rdkit.Chem.MolFromSmiles(x) for x in df["SMILES"]]
//So first generating a bunch of molecules from the Smiles that I have in my dataframe
graphs = [ctx.obj_to_graph(x) for x in objs], where ctx is FragMolBuildingEnvContext object
Now here's how my molecules from the dataset look like: And here are the corresponding Smiles Values: "Nc1nnc(-c2ccc(N+[O-])o2)o1"
"O=C(C[C@H]1CC[C@@H]2C@Hcc2)O1)NCCc1ccncc1"
"CC(C)CC@@HC(=O)N[C@H]1C(=O)NC@@HC(=O)N[C@H]2C(=O)N[C@H]3C(=O)NC@Hc4cc(O)cc(O)c4-c4cc3ccc4O)[C@H]"
"(O)c3ccc(c(Cl)c3)Oc3cc2cc(c3O[C@@H]2OC@HC@@HC@H[C@H]2O[C@H]2CC@(N)C@HC@HO2)Oc2ccc(cc2Cl)[C@H]1O"
"O=C1CN(/N=C/c2ccc(N+[O-])o2)C(=O)N1"
"Cn1cnc(CCNC(=O)C[C@@H]2CC[C@@H]3C@Hcc3)O2)c1"
"O=C(C[C@@H]1CC[C@H]2C@@Hcc2)O1)NCCc1ccncc1"
"Cc1nnc(SCC2=C(C(=O)O)N3C(=O)C@@H[C@H]3SC2)s1"
"CCC@HC@HC1=NC@HC(=O)N[C@@H]2CCC(=O)O[Zn]OC(=O)C[C@H]3NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@@HNC(=O)C@HNC3=O)NC(=O)C@HNC2=O)CS1"
Now is it because, these molecules are MacroCycles i.e. too big?
Also, I found that it is possible to build their graphs if using MolBuildingEnvContext class, will the two produce different results? Because I was able to train my model, but the predictions don't seem right.
Also after you are done training a model, to get the model to generate new molecules , we just do the following right model.eval() algo.create_training_data_from_own_samples(model, 8)
Or is there some other method to get the model to generate molecules post training?
It does look like these molecules contain subgraphs which are not within the default fragment set, so indeed converting them would not work. You may be able to generate your own fragment set which could express most of your compounds. I've been able to train fragment GFNs which were using 1000+ fragments.
I haven't published a proper script to do this fragment generation yet, but you should be able to reuse the code here in the original project from which GFlowNet was born.
An alternative is indeed to use the MolBuildingEnvContext, which instead of expressing molecules as graphs of fragments, expresses them as graphs of atoms. This is of course much harder because you're now training a model to generate something atom-by-atom, but it can in theory express "any" molecule.
Also after you are done training a model, to get the model to generate new molecules , we just do the following right
model.eval()
algo.create_training_data_from_own_samples(model, 8)
Yes this should be sufficient.
Thanks a lot for the clarification. Will try creating appropriate fragments and putting them to use.
I created fragments specific to my dataset using BRICS from rdkit, but I have a bunch of questions:
i.e. They have a number in asterisk attached to the region where the bond was cleaved, now if we number the atoms, they look like: Now, since for the fragments set, we need to have the SMILES of the molecules along with the atom index where other fragments can be connected, what I have done is to remove the [int] from the molecules and take it's immediate neighbor as the connecting node, so for instance, for the above fragments, the unchanged smiles look like: '[14]c1ccc(N+[O-])o1', '[14*]c1nnc(N)o1'
After removing the [14] part, the smiles and their respective fragments look like: 'c1ccc(N+[O-])o1' , 'c1nnc(N)o1' Now, we can see that the Carbon atoms that had 14 attached to them have now taken upon their index positions, so is it right if finally I consider the fragments and their respective atom_idx to be:
('c1ccc(N+[O-])o1',[0]) and ('c1nnc(N)o1',[0])
is it right if finally I consider the fragments and their respective atom_idx to be:
Yes, that's more or less how we did it. We then look for fragments that are used more than once, and if in their repetitions they are using different attachment atoms we add those atoms to the list.
so the molecule itself ends up being the fragment
This is a bit trickier, but a reasonable heuristic is to find carbon atoms with free hydrogens/valence and use those as attachment atoms.
@bengioe has something changed in the code base? Because the colab kernel seems to be crashing at the algo.create_training_data... function. At first I thought, there might be some issues with my fork or the data, but even the getting started colab notebook linked in the implementation notes throws the same issue.
If the error is exclusively on my side, then I had one more issue prior to this. I was able to construct the graphs after feeding the model my own fragments for the molecules that I have, and I have the corresponding reward values as well. Here's the file: https://drive.google.com/file/d/15VZ3ixBYKoWy_OUnEeuwnmsdPSZiTCdQ/view?usp=sharing But while training, I get nan losses and the model doesn't learn anything. I tried to trace where it might be shooting off to nan but couldn't. Any chance you could help with this?
PS: The picke file in the drive link contains a dictionary with 'Graphs' key containing the graphs, and 'Rewards' Key containing the corresponding rewards
Alright, I was able to resolve the kernel crashing error by going to tools->command palette->Use Fallback Runtime as it seems that Colab has gone through some updates, but still can't resolve the null value error
You're right, colab updated their torch version which broke the torch_geometric pip install
command. I updated it, should now be !pip install torch_geometric torch_sparse torch_scatter rdkit gitpython omegaconf wandb --find-links https://data.pyg.org/whl/torch-2.3.0+cu121.html
.
I can't really help with the specific files you're sharing, but maybe one thing to make sure of is that your reward is always strictly positive so that the log-reward is not NaN (or inf). Otherwise, when does it NaN? After a single gradient step, or many? Does the learning rate slow this down?
Thank you for the torch-geometric clarification.
And on the null values, yes, I am making sure that the reward values are all positive, and the training steps all look something like this:
So it outright starts with null values. Initially, I remember getting real values for the first two epochs or so after which it shoots off to inf or Nan, but now it's just null everywhere.
From your suggestion, I tried inc/dec the learning rate, but the null values remained.
Alright, @bengioe I think I know why these null values are there, there seems to be some mismatch in the way I have indexed the fragments, I need to look deeper into that. Thanks a lot for your guidance, will get back to you with the improvements in the fragments. Thanks again.
The function obj_to_graph takes forever to generate graphs for some molecules even though the are of the same size as the molecules that it can randomly sample. So it takes like 50 minutes to generate the graph and even then it returns Nonetype