openbabel / openbabel

Open Babel is a chemical toolbox designed to speak the many languages of chemical data.
http://openbabel.org/
GNU General Public License v2.0
1.09k stars 416 forks source link

smiles to xyz conversion isn't repeatable #1934

Open yurivict opened 5 years ago

yurivict commented 5 years ago

With this x.smi file: O=C(O)CCCCC I run this command:

obabel -i smi x.smi -o xyz -O x1.xyz --gen3d

The produced outputs are a little bit different with every run:

$ diff x1.xyz x2.xyz 
3,22c3,22
< O          2.80897        0.95953       -0.55185
< C          2.31033       -0.02262       -0.02710
< O          0.97006       -0.19149        0.06313
< C          3.07414       -1.17334        0.58772
< C          4.58550       -0.97549        0.48201
< C          5.34900       -2.14439        1.10655
< C          6.86037       -1.94108        0.99792
< C          7.62678       -3.09920        1.61670
< H          0.70739       -1.02022        0.50592
< H          2.78449       -2.09272        0.06679
< H          2.78449       -1.25134        1.64155
< H          4.87040       -0.04029        0.98021
< H          4.87040       -0.86979       -0.57232
< H          5.06672       -3.07791        0.60475
< H          5.06672       -2.24615        2.16150
< H          7.14793       -1.01060        1.50124
< H          7.14793       -1.84227       -0.05533
< H          7.38728       -3.20534        2.67958
< H          8.70461       -2.93105        1.52686
< H          7.38728       -4.04175        1.11413
---
> O          2.79827        0.94419       -0.78732
> C          2.30225        0.05508       -0.11491
> O          0.96238       -0.10012        0.00253
> C          3.06876       -0.98420        0.67100
> C          4.57978       -0.80310        0.53395
> C          5.34512       -1.85925        1.33266
> C          6.85617       -1.67396        1.19240
> C          7.62428       -2.72048        1.98393
> H          0.70195       -0.85007        0.56965
> H          2.78027       -1.97412        0.30044
> H          2.78035       -0.89719        1.72445
> H          4.86364        0.19878        0.87975
> H          4.86359       -0.86301       -0.52423
> H          5.06357       -2.86003        0.98314
> H          5.06370       -1.79534        2.39085
> H          7.14294       -0.67605        1.54402
> H          7.14291       -1.74056        0.13642
> H          7.38553       -2.65956        3.05050
> H          8.70190       -2.56728        1.86813
> H          7.38557       -3.73020        1.63495

The actual molecules are the same, typical RMSD differences are 0.00190, 0.00102, 0.00067.

OpenBabel should never use random numbers to randomize molecules, because this is bad for testing frameworks that expect repeatability.

n-yoshikawa commented 5 years ago

We use random offset because we don't want to plant all base atoms at exactly the same spot in the process of coordinate generation (Edited): https://github.com/openbabel/openbabel/blob/master/src/builder.cpp#L1141

Maybe we should introduce a new option to fix random seed.

yurivict commented 5 years ago

But xyz coordinates after gen3D are supposed to be all different, no?

n-yoshikawa commented 5 years ago

Yes. Random numbers are used in the process of 3D coordinate generation.

yurivict commented 5 years ago

Yes. Random numbers are used in the process of 3D coordinate generation.

But this isn't needed because coordinates should be all different. No need to add random numbers.

n-yoshikawa commented 5 years ago

SMILES format does not include the information of 3D coordinates, so we must predict them. The coordinates should be different in the end of the coordinate generation, but they are unknown in the beginning. Random offset is used in this prediction and we had problems without random numbers.

yurivict commented 5 years ago

gen3D plugin should erase randomness, and rewrite them with new values it generates. Why does it keep random coordinates?

n-yoshikawa commented 5 years ago

Because newly generated values have randomness because of the nature of the coordinate generation algorithm which uses random numbers inside.

We will update the coordinate generation algorithm in the near future, and I want to make it reproducible. In my opinion, introducing a new option to set random seed is a good option.

baoilleach commented 5 years ago

I agree that the same coordinates should be generated each time. Also, we should not use random numbers. The original problems for which they were introduced can be solved without random numbers.

yurivict commented 4 years ago

@baoilleach @ghutchis

This is still a problem in OpenBabel-3.1.1

Is there a workaround to get rid of randomness?

n-yoshikawa commented 4 years ago

I totally agree that 3D structure generation should be repeatable, but randomness is sometimes useful in order to generate multiple conformations. Especially, distance geometry, a new 3D structure generation algorithm introduced in Open Babel 3.1.0, needs to use random numbers to generate multiple conformations.

I believe introducing a new argument to specify the random seed is the best solution. However, I'm wondering how to fix the seed, since OBRandom is removed from the public API in Open Babel 3.0.0 (#1954).

yurivict commented 4 years ago

It's a very bad idea to merge operations that don't belong together. In this case there seem to be two operations: 1. SMILES->COORDS convertor, 2. coordinate randomizer, that are merged into one.

Some people only need the convertor, some people only need the randomizer, and some people need both, but OpenBabel merged these two into one and forces everybody to have both together.

Just split them into two separate functions, and this would solve problems like this one.

e-kwsm commented 4 years ago

I'm thinking to add seed parameters, defaulted to e.g. epoch, to functions that use OBRandom (will be added to #2241).

e-kwsm commented 4 years ago

I added Seed method to the classes that has prng, which enables a user to manually feed a seed before using random numbers.

At present I have no modifications on functions and executables which indirectly uses random numbers, e.g., obabel.

baoilleach commented 4 years ago

Regarding the original question, we shouldn't be using random number generators in 3D coordinate generation (outside the distance geometry). There is no need for them, and they should be removed. In 3.0 I removed OBRandom from the public API so that this change could be made more easily, and to avoid future uses of random numbers. PRs accepted to fix this problem.

n-yoshikawa commented 4 years ago

@e-kwsm Thank you for your work. I will take a look.

@baoilleach I understand that randomness should be removed. I would like to work on it, but I don't have a good idea about how to avoid using random numbers. For example, this procedure adds random offset to avoid planting atoms at the same place. How can I give a good offset which is deterministic but (almost) always different? https://github.com/openbabel/openbabel/blob/4709b0752109db593f1bee6298ef3bfa718d260d/src/builder.cpp#L1290

Another problem is how to realize trial and error without random numbers like this: https://github.com/openbabel/openbabel/blob/4709b0752109db593f1bee6298ef3bfa718d260d/src/builder.cpp#L315

I would like to know your idea about how to avoid using random numbers in these situations.

ghutchis commented 4 years ago

@baoilleach - while it's good to minimize randomness through the use of designed seeds, generating 3D geometries require stochastic methods. Beyond the basic builder, there's the conformer searching, distance geometry, …

I asked @e-kwsm to allow the seed to be set from the command-line, which seems like it would solve this bug.

baoilleach commented 4 years ago

I think we agree that when we need random numbers we should use them. My opinion is that we shouldn't use them otherwise.

@n-yoshikawa In the first case, if you look at the original bug where this was the fix, I think shifting the salt along the x-axis past all of the existing atoms will fix this. In the second case, iterating over a fixed list of 9 vectors that cover the sphere is sufficient - we just need to find one that is not aligned or othogonal to the query vector (or something like this).

e-kwsm commented 4 years ago

It is hard to change all APIs related to PRNG, so I added OB_RANDOM_SEED environment variable (#2241).

$ export OB_RANDOM_SEED=42
$ diff <(obabel -:CCC --gen3D -oxyz) <(obabel -:CCC --gen3D  -oxyz)
1 molecule converted
1 molecule converted
jasondbiggs commented 3 years ago

@ghutchis and @e-kwsm what is the status of this? It would be great to be able to pass in an option to obabel giving a seed so that the --gen3d option always returns the same geometry. The last commit here gives a workable solution I think, where you would have to set the seed as an environment variable before calling obabel, but it hasn't been merged.

e-kwsm commented 3 years ago

2241 is ready.

byte-for-byte commented 3 years ago

Hi, I would also be very interested in the suggested solution to provide the env var OB_RANDOM_SEED. I this change usable at all at this point? thanks

jasondbiggs commented 2 years ago

@byte-for-byte it is indeed usable. I have built from the branch in that pull request and everything works as expected.