recursionpharma / gflownet

GFlowNet library specialized for graph & molecular data
MIT License
211 stars 41 forks source link

FragMolBuildingEnvContext function obj_to_graph unable to generate Graph for certain molecules #133

Open W-OK-E opened 6 months ago

W-OK-E commented 6 months ago

The function obj_to_graph takes forever to generate graphs for some molecules even though the are of the same size as the molecules that it can randomly sample. So it takes like 50 minutes to generate the graph and even then it returns Nonetype

W-OK-E commented 6 months ago

Rephrasing the issue properly

bengioe commented 6 months ago

Not all molecules can be expressed by the default fragment set used in the code base. Normally the search stops after a while if that's the case, but I don't recall it taking up to 50 minutes. If you can give more specific examples I may be able to give better insights.

W-OK-E commented 6 months ago

Yes, the _recursive_decompose function in frag_mol_env.py is meant to stop if it takes more than 1000 recursive iterations, but I removed it in my fork as it was being a bottle neck, and even then after about an hour, I get Nonetype for the molecules that I am trying to convert.

Here's how I was trying to generate graphs:

objs = [rdkit.Chem.MolFromSmiles(x) for x in df["SMILES"]]

//So first generating a bunch of molecules from the Smiles that I have in my dataframe

graphs = [ctx.obj_to_graph(x) for x in objs], where ctx is FragMolBuildingEnvContext object

Now here's how my molecules from the dataset look like: image And here are the corresponding Smiles Values: "Nc1nnc(-c2ccc(N+[O-])o2)o1"

"O=C(C[C@H]1CC[C@@H]2C@Hcc2)O1)NCCc1ccncc1"

"CC(C)CC@@HC(=O)N[C@H]1C(=O)NC@@HC(=O)N[C@H]2C(=O)N[C@H]3C(=O)NC@Hc4cc(O)cc(O)c4-c4cc3ccc4O)[C@H]"

"(O)c3ccc(c(Cl)c3)Oc3cc2cc(c3O[C@@H]2OC@HC@@HC@H[C@H]2O[C@H]2CC@(N)C@HC@HO2)Oc2ccc(cc2Cl)[C@H]1O"

"O=C1CN(/N=C/c2ccc(N+[O-])o2)C(=O)N1"

"Cn1cnc(CCNC(=O)C[C@@H]2CC[C@@H]3C@Hcc3)O2)c1"

"O=C(C[C@@H]1CC[C@H]2C@@Hcc2)O1)NCCc1ccncc1"

"Cc1nnc(SCC2=C(C(=O)O)N3C(=O)C@@H[C@H]3SC2)s1"

"CCC@HC@HC1=NC@HC(=O)N[C@@H]2CCC(=O)O[Zn]OC(=O)C[C@H]3NC(=O)C@HNC(=O)C@@HNC(=O)C@HNC(=O)C@@HNC(=O)C@HNC3=O)NC(=O)C@HNC2=O)CS1"

Now is it because, these molecules are MacroCycles i.e. too big?

Also, I found that it is possible to build their graphs if using MolBuildingEnvContext class, will the two produce different results? Because I was able to train my model, but the predictions don't seem right.

Also after you are done training a model, to get the model to generate new molecules , we just do the following right model.eval() algo.create_training_data_from_own_samples(model, 8)

Or is there some other method to get the model to generate molecules post training?

bengioe commented 6 months ago

It does look like these molecules contain subgraphs which are not within the default fragment set, so indeed converting them would not work. You may be able to generate your own fragment set which could express most of your compounds. I've been able to train fragment GFNs which were using 1000+ fragments.

I haven't published a proper script to do this fragment generation yet, but you should be able to reuse the code here in the original project from which GFlowNet was born.

An alternative is indeed to use the MolBuildingEnvContext, which instead of expressing molecules as graphs of fragments, expresses them as graphs of atoms. This is of course much harder because you're now training a model to generate something atom-by-atom, but it can in theory express "any" molecule.

Also after you are done training a model, to get the model to generate new molecules , we just do the following right model.eval() algo.create_training_data_from_own_samples(model, 8)

Yes this should be sufficient.

W-OK-E commented 6 months ago

Thanks a lot for the clarification. Will try creating appropriate fragments and putting them to use.

W-OK-E commented 6 months ago

I created fragments specific to my dataset using BRICS from rdkit, but I have a bunch of questions:

  1. Take for instance this molecule: image When BRICS generates the fragments, they look something like: image

i.e. They have a number in asterisk attached to the region where the bond was cleaved, now if we number the atoms, they look like: image Now, since for the fragments set, we need to have the SMILES of the molecules along with the atom index where other fragments can be connected, what I have done is to remove the [int] from the molecules and take it's immediate neighbor as the connecting node, so for instance, for the above fragments, the unchanged smiles look like: '[14]c1ccc(N+[O-])o1', '[14*]c1nnc(N)o1'

After removing the [14] part, the smiles and their respective fragments look like: 'c1ccc(N+[O-])o1' , 'c1nnc(N)o1' image Now, we can see that the Carbon atoms that had 14 attached to them have now taken upon their index positions, so is it right if finally I consider the fragments and their respective atom_idx to be:

('c1ccc(N+[O-])o1',[0]) and ('c1nnc(N)o1',[0])

  1. For some molecules in my dataset, BRICS doesn't find any fragments, so the molecule itself ends up being the fragment. So in that case naturally the atom_idx list is empty, will that hinder the learning process in any way. e.g. 'Cc1ncc[nH]1' image
bengioe commented 6 months ago

is it right if finally I consider the fragments and their respective atom_idx to be:

Yes, that's more or less how we did it. We then look for fragments that are used more than once, and if in their repetitions they are using different attachment atoms we add those atoms to the list.

so the molecule itself ends up being the fragment

This is a bit trickier, but a reasonable heuristic is to find carbon atoms with free hydrogens/valence and use those as attachment atoms.

W-OK-E commented 5 months ago

@bengioe has something changed in the code base? Because the colab kernel seems to be crashing at the algo.create_training_data... function. At first I thought, there might be some issues with my fork or the data, but even the getting started colab notebook linked in the implementation notes throws the same issue.

If the error is exclusively on my side, then I had one more issue prior to this. I was able to construct the graphs after feeding the model my own fragments for the molecules that I have, and I have the corresponding reward values as well. Here's the file: https://drive.google.com/file/d/15VZ3ixBYKoWy_OUnEeuwnmsdPSZiTCdQ/view?usp=sharing But while training, I get nan losses and the model doesn't learn anything. I tried to trace where it might be shooting off to nan but couldn't. Any chance you could help with this?

PS: The picke file in the drive link contains a dictionary with 'Graphs' key containing the graphs, and 'Rewards' Key containing the corresponding rewards

W-OK-E commented 5 months ago

Alright, I was able to resolve the kernel crashing error by going to tools->command palette->Use Fallback Runtime as it seems that Colab has gone through some updates, but still can't resolve the null value error

bengioe commented 5 months ago

You're right, colab updated their torch version which broke the torch_geometric pip install command. I updated it, should now be !pip install torch_geometric torch_sparse torch_scatter rdkit gitpython omegaconf wandb --find-links https://data.pyg.org/whl/torch-2.3.0+cu121.html.

bengioe commented 5 months ago

I can't really help with the specific files you're sharing, but maybe one thing to make sure of is that your reward is always strictly positive so that the log-reward is not NaN (or inf). Otherwise, when does it NaN? After a single gradient step, or many? Does the learning rate slow this down?

W-OK-E commented 5 months ago

Thank you for the torch-geometric clarification.
And on the null values, yes, I am making sure that the reward values are all positive, and the training steps all look something like this: image

So it outright starts with null values. Initially, I remember getting real values for the first two epochs or so after which it shoots off to inf or Nan, but now it's just null everywhere.

From your suggestion, I tried inc/dec the learning rate, but the null values remained.

W-OK-E commented 5 months ago

Alright, @bengioe I think I know why these null values are there, there seems to be some mismatch in the way I have indexed the fragments, I need to look deeper into that. Thanks a lot for your guidance, will get back to you with the improvements in the fragments. Thanks again.