Issue loading trajectory with pyemma.coordinates.source when topology file changes

AnjaConev commented 2 years ago

Hello,

Thanks for this amazing package - it has been very fun to work with and very reliable!

I have encountered a strange issue when loading the trajectory with pyemma.coordinates.source. (PyEmma version = 2.5.7 on Debian GNU/Linux 9)

Here is the example of what is happening:

I have a trajectory and a reference PDB file (X.xtc and X.pdb). They both have 10 atoms. I load the trajectory for the first time with:

pyemma.coordinates.source("./X.xtc", top="./X.pdb")

Everything works as expected.

Later in my workflow I have new files with the same names: X.xtc, X.pdb but they now refer to a new trajectory with 18 atoms. I make the same call to load this new trajectory:

pyemma.coordinates.source("./X.xtc", top="./X.pdb")

It now fails with the error: ValueError: xyz must be shape (Any, 10, 3). You supplied (1, 18, 3)

If I reload the package and call source again with the new files it works.

It seems like the pdb file name gets cashed somewhere and pyemma.coordinates.source still thinks that we are dealing with the old X.pdb. I was able to work around this issue without reloading the package by running: mdtraj_top = mdtraj.load("./X.pdb").topology pyemma.coordinates.source("./X.xtc", top= mdtraj_top)

The code for reproducing this example and the corresponding files are in example.zip example.zip

I just wanted to report this as it might be an unexpected behavior and I was stuck for a while on this issue.

My python environment: pip_list.txt

Full error message: error_message

thempel commented 2 years ago

Hi, I can reproduce this error locally. I believe it's a problem with the topology cache. (This cache is there because it accelerates loading large sets of trajectories with the same topology.) As far as I see, it is implemented using an LRU-cache here: https://github.com/markovmodel/PyEMMA/blob/8c2bc84dc1f0230202f368b5799908c16ebd611b/pyemma/coordinates/util/patches.py#L40-L52

From this code it also becomes clear why your workaround works. Unfortunately I don't think that the cache can be turned off as there is no option in the config. Maybe @clonker knows how to fix the cache?

clonker commented 2 years ago

I'll take a look at it. :slightly_smiling_face: Thanks for reporting this @AnjaConev!

AnjaConev commented 2 years ago

Thank you both for a quick response! @thempel that explanation makes sense. Maybe you can clear the cache when a new call is made to the source api? I'm not sure if this would mess up other things though :)

clonker commented 2 years ago

Definitely sounds like a potential fix! Thanks :rocket:

clonker commented 2 years ago

Opted for extending the lru cache key by hash of the first MB of the file contents plus last modified and creation date - that should be unique enough 🙂 Although I am pretty sure things should be fixed now, perhaps you (@AnjaConev) can check your script against the devel branch to see if it works for you too.

AnjaConev commented 2 years ago

Nice approach 😄 I tested it the new version on the examples that were previously failing for me and everything worked well!! Thanks for your help!

markovmodel / PyEMMA

Issue loading trajectory with pyemma.coordinates.source when topology file changes #1541