markovmodel / PyEMMA

🚂 Python API for Emma's Markov Model Algorithms 🚂
http://pyemma.org
GNU Lesser General Public License v3.0
307 stars 118 forks source link

[source] should be allowed to produce output with very high stride #480

Closed gph82 closed 7 years ago

gph82 commented 9 years ago

I'm aware that the entire transformation is jut to huge for memory, but with a (ridiculously) high enough stride I should be able to get 4124 frames x 57 dimensions, right?

print my_SOURCE.dimension()
print my_SOURCE.n_frames_total()
mystride = int(1e3)
print my_SOURCE.n_frames_total()/mystride
#57 ndim
#4124963 n_original frames
#4124 n_frames after stride

However:


D = my_SOURCE.get_output(stride=mystride)

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-52-b6755ba4bb48> in <module>()
----> 1 D = my_SOURCE.get_output(stride=mystride)

/home/mi/gph82/programs/PyEmma/pyemma/coordinates/transform/transformer.pyc in get_output(self, dimensions, stride)
    630                                'getting output of ' + self.__class__.__name__)
    631 
--> 632         for itraj, chunk in self.iterator(stride=stride):
    633             if itraj != last_itraj:
    634                 last_itraj = itraj

/home/mi/gph82/programs/PyEmma/pyemma/coordinates/transform/transformer.pyc in next(self)
     60         last_itraj = self._transformer._itraj
     61         if self._lag == 0:
---> 62             X = self._transformer._next_chunk(lag=self._lag, stride=self._stride)
     63             return (last_itraj, X)
     64         else:

/home/mi/gph82/programs/PyEmma/pyemma/coordinates/data/feature_reader.pyc in _next_chunk(self, lag, stride)
    166         :return: a feature mapped vector X, or (X, Y) if lag > 0
    167         """
--> 168         chunk = self._mditer.next()
    169         shape = chunk.xyz.shape
    170 

/home/mi/gph82/programs/PyEmma/pyemma/coordinates/util/patches.pyc in iterload(filename, chunk, **kwargs)
    113             while True:
    114                 if extension not in _TOPOLOGY_EXTS:
--> 115                     traj = f.read_as_traj(topology, n_frames=chunk*stride, stride=stride, atom_indices=atom_indices, **kwargs)
    116                 else:
    117                     traj = f.read_as_traj(n_frames=chunk*stride, stride=stride, atom_indices=atom_indices, **kwargs)

xtc.pyx in xtc.XTCTrajectoryFile.read_as_traj (mdtraj/formats/xtc/xtc.c:3584)()

xtc.pyx in xtc.XTCTrajectoryFile.read (mdtraj/formats/xtc/xtc.c:4134)()

xtc.pyx in xtc.XTCTrajectoryFile._read (mdtraj/formats/xtc/xtc.c:5552)()

MemoryError: 

Perhaps I'm wrong, but 4124*1./1024 .ca 4 KB, right?

gph82 commented 9 years ago

Is this an mdtraj thing? Nope, but mdtraj notices it:

By reading "n_frames=chunk * stride", the actual chunksize parsed to mdtraj gets bigger, with larger strides. Kind of counter-intuitive, since as the user is not aware of this, he may drive the stride up trying to alleviate memory problems, while effectively creating them. As a matter of fact,

D = my_SOURCE.get_output(stride=10)

works fine

rmcgibbo commented 9 years ago

Iterload is designed to yield the entire trajectory chunk by chunk. If you want only every nth frame from the trajectory, you're much better off getting the file handle, reading a frame, seeking forward, and then repeating.

handle = md.open('trajectory.xtc')
for _ in range(10):
    yield handle.read_as_traj(topology, n_frames=1)   # get a single frame
    handle.seek(100, whence=1)     # advance 100 frames forward in the trajectory
gph82 commented 9 years ago

Thanks @rmcgibbo. I noticed only afterwards it was not mdtraj. It's really the fact that larger strides result in mdtraj trying to load larger chunks (which is a problem of ours, courtesy of chunk*stride). Guess we'll implement your suggestion soon.

rmcgibbo commented 9 years ago

np

fabian-paul commented 9 years ago

Hmm, @rmcgibbo I don't want to be nitpicking, but shouldn't the stride argument of iterload just work as Guillermo was expecting it naively (principle of least astonishment)? If I understand you correctly, you are telling us, that we should implement our own optimized version of iterload if we want it to be memory efficient. IMHO this optimization should be part of mdtraj. Would you accept a pull request if we were to implement this directly in mdtraj?

rmcgibbo commented 9 years ago

Sorry, I didn't mean to come off too strong. It would clearly be better if mdtraj behaved better in this situation. Extremely long strides (relative to memory) weren't really on my mind when writing the code, which is why it's not optimized for this use case (yet). See e.g. https://github.com/mdtraj/mdtraj/blob/master/mdtraj/formats/xtc/xtc.pyx#L351-L359, where you can tell _read just reads everything into the buffer, which is sliced aferwards.

Would you accept a pull request if we were to implement this directly in mdtraj?

Yeah, absolutely.

rmcgibbo commented 9 years ago

I can also help write it.

franknoe commented 9 years ago

Just a high-level comment: I would love to see if we start contributing things to mdtraj. We are using mdtraj extensively and it would be very useful if one or two people in my lab had the expertise needed to tweak and extend it if we run into and limitations.

Am 04/08/15 um 19:23 schrieb Robert T. McGibbon:

Sorry, I didn't mean to come off too strong. It would clearly be better if mdtraj behaved better in this situation. Extremely long strides (relative to memory) we're really on my mind when writing the code, which is why it's not optimized for this use case (yet). See e.g. https://github.com/mdtraj/mdtraj/blob/master/mdtraj/formats/xtc/xtc.pyx#L351-L359, where you can tell |_read| just reads everything into the buffer, which is sliced aferwards.

Would you accept a pull request if we were to implement this
directly in mdtraj?

Yeah, absolutely.

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/480#issuecomment-127683679.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

fabian-paul commented 9 years ago

This week I discussed with @clonker about a related change to mdtraj. We were thinking about implementing index files for the xtc format and making xdrlib's ftell/fseek accessible to mdtraj. (Which I think is possible.) So there are some things that we could do together.

gph82 commented 9 years ago

I would love to see random access implemented for xtcs, but AFAIK building the index might be more time consuming than just streaming in certain situations, or?

clonker commented 9 years ago

You don't have to build the whole index at once, just up to the point you need it and then later resume. Like that you won't loose (much) time compared to streaming in the first pass and be faster about it in subsequent passes.

rmcgibbo commented 9 years ago

Different formats have different strengths and weaknesses. It would also be nice to add support for the new gromacs TNG format, which supports seeking as well as good compression, to mdtraj. Might strike a better balance for some use cases.

gph82 commented 8 years ago

Hi, any more thoughts on this? I'm running into the same problems of not being able to choose large strides because of -paradoxically- too large strides. So I am forced to lower strides, effectively slowing down my computation

gph82 commented 8 years ago

Now that we've started several discussion on chunking, efficiency etc: https://github.com/markovmodel/PyEMMA/issues/616 https://github.com/markovmodel/PyEMMA/issues/604 it might be a good moment to revisit this, I think

gph82 commented 8 years ago

Hi @franknoe , here are the "culprit" lines: https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L156 and https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L158

 if extension not in _TOPOLOGY_EXTS:
                    traj = f.read_as_traj(topology, n_frames=chunk*stride, stride=stride, atom_indices=atom_indices, **kwargs)
                else:
                    traj = f.read_as_traj(n_frames=chunk*stride, stride=stride, atom_indices=atom_indices, **kwargs)
franknoe commented 8 years ago

This needs to be discussed with @marscher or @fabian-paul, because I don't know that part of the code.

Am 09/11/15 um 19:09 schrieb Guillermo Pérez-Hernández:

Hi @franknoe https://github.com/franknoe , here are the "culprit" lines: https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L156 and https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L158

if extensionnot in _TOPOLOGY_EXTS: traj= f.read_as_traj(topology,n_frames=chunk_stride,stride=stride,atom_indices=atom_indices,__kwargs) else: traj= f.read_as_traj(n_frames=chunk_stride,stride=stride,atom_indices=atom_indices,**kwargs)

— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/480#issuecomment-155143188.


Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354 Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany

marscher commented 8 years ago

This is still an issue. We really should use the approach which Robert pointed out. We need to collected all the frames via seek or just pass the desired indices to mdtraj.load (which will handle this efficiently via seeking in the future).

marscher commented 7 years ago

Random access patterns are already in place for the iterload wrapper, so we just need to generate the RA indices to access the file for very large strides to avoid the chunksize*stride > memory limitation. Since I also implemented fast seeking for xtc and most formats are seekable, this should be easy to add.