markovmodel / adaptivemd

A python framework to run adaptive Markov state model (MSM) simulation on HPC resources
GNU Lesser General Public License v2.1
18 stars 7 forks source link

mongodb file size limitations #26

Closed thempel closed 7 years ago

thempel commented 7 years ago

I just came across a pymongo.errors.DocumentTooLarge error on a very small dataset, probably because I accidentally produced a huge transition matrix. We might need to solve this problem sooner or later, especially because the discrete trajectories become larger and larger with time...

jhprinz commented 7 years ago

Agreed, I need to think about the best way to do that.

  1. Store the complete file, but then you cannot search or do cool stuff like with the other objects in it.
  2. Break the large object down into seperate parts. That already works, but you need to know now to use subobjects in the Model you return. This also allows to access subparts easily. Example. We create a DiscretizedTrajectory objects and instead of writing an array of n_traj x length you write ntraj separate objects. and then only store references. Could still be too small...
  3. Run the picking of frames on the cluster and only return the new frames. That should also be possible already, but you need to write a function that does that.
thempel commented 7 years ago

Hmm. What about option 1 with additionally copying the file into the working directory on the user's machine? I assume it wouldn't be usable within the DB, but it could be loaded into the script/notebook the user is using as numpy array. About option 2, I'm a bit sceptical because in my experience, there will be a lot of (potentially useless...) MSMs which we don't really need to store. So chopping-up everything and storing it in the DB might just artificially blow things up.

jhprinz commented 7 years ago

What about option 1 with additionally copying the file into the working directory on the user's machine?

That would be no problem I guess. Good idea. It will require some thinking about the implementation, you might even not have to write it to disk.

Actually I just checked. This is really super simple...