Identical molecule found for num_molecules > 1

hoelzerC commented 1 month ago

It seems that always the same molecule is found when choosing num_molecules > 1.

Can easily be reproduced, taking the default config.

hoelzerC commented 1 month ago

Same result for CLI usage: mindlessgen --num-molecules 2

hoelzerC commented 1 month ago

Ok, I resolved the issue. It seems to be a cross-issue with other packages installed in my virtual environment. Having a plain mindlessgen environment allows for generation of multiple molecules. 🎉️

As a remark, maybe include some sanity check that not the same hash is used for multiple runs. E.g. check the molecule ID written to mindless.molecules.

marcelmbn commented 1 month ago

Great to hear! 🎉 Do you know the exact reason for np.random giving always the same result? So that we can include it to a Known Issues section?

hoelzerC commented 1 month ago

Hard to say, especially because the numpy.random seed was different at each run. Guess it was an exotic cross-effect. Hopefully, this should not happen to too many users. As mentioned above, checking for identical output might be a good sanity check and worth raising a notification.

XinChenQC commented 1 month ago

My solution is to add a random number seed associated with time like this, It's not an elegant way, but works

in /src/mindlessgen/molecules/generate_molecule.py

357     def generate_random_coordinates(at: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
358         """
359         Generate random coordinates for a molecule.
360         """
361         # import the time module
362         import time
363
364         # get the current time in seconds since the epoch
365         seconds = time.time()
366
367         atilist: list[int] = []
368         xyz = np.zeros((sum(at), 3))
369         numatoms = 0
370         np.random.seed(int(seconds)+1929)

marcelmbn commented 1 month ago

My solution is to add a random number seed associated with time like this, It's not an elegant way, but works

in /src/mindlessgen/molecules/generate_molecule.py

357     def generate_random_coordinates(at: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
358         """
359         Generate random coordinates for a molecule.
360         """
361         # import the time module
362         import time
363
364         # get the current time in seconds since the epoch
365         seconds = time.time()
366
367         atilist: list[int] = []
368         xyz = np.zeros((sum(at), 3))
369         numatoms = 0
370         np.random.seed(int(seconds)+1929)

Thanks for your feedback! But shouldn't the np.random.seed be not fixed in standard applications and clean environments? Do you have an idea how to check for this issue before starting a calculation or how to get rid of it?

EDIT: The np.random.seed(<value>) assignment acts as a global variable and can therefore be modified in other code pieces or previous numpy executions. Might contained in this blog post, in which the current practice in mindlessgen is considered as legacy best practice: https://builtin.com/data-science/numpy-random-seed The correct way to do it would probably be the following (setting up an own random number generator instance): https://numpy.org/doc/stable/reference/random/index.html

XinChenQC commented 1 month ago

Thanks for your reply, My guess is the 'random number generator' in the multiprocessing pool generates the same seeds for each subprocess. Just like discussion in this question, https://stackoverflow.com/questions/29854398/seeding-random-number-generators-in-parallel-programs

grimme-lab / MindlessGen

Identical molecule found for num_molecules > 1 #64