Closed gph82 closed 7 years ago
Is this an mdtraj thing? Nope, but mdtraj notices it:
By reading "n_frames=chunk * stride", the actual chunksize parsed to mdtraj gets bigger, with larger strides. Kind of counter-intuitive, since as the user is not aware of this, he may drive the stride up trying to alleviate memory problems, while effectively creating them. As a matter of fact,
D = my_SOURCE.get_output(stride=10)
works fine
Iterload is designed to yield the entire trajectory chunk by chunk. If you want only every n
th frame from the trajectory, you're much better off getting the file handle, reading a frame, seeking forward, and then repeating.
handle = md.open('trajectory.xtc')
for _ in range(10):
yield handle.read_as_traj(topology, n_frames=1) # get a single frame
handle.seek(100, whence=1) # advance 100 frames forward in the trajectory
Thanks @rmcgibbo. I noticed only afterwards it was not mdtraj. It's really the fact that larger strides result in mdtraj trying to load larger chunks (which is a problem of ours, courtesy of chunk*stride). Guess we'll implement your suggestion soon.
np
Hmm, @rmcgibbo I don't want to be nitpicking, but shouldn't the stride argument of iterload just work as Guillermo was expecting it naively (principle of least astonishment)? If I understand you correctly, you are telling us, that we should implement our own optimized version of iterload if we want it to be memory efficient. IMHO this optimization should be part of mdtraj. Would you accept a pull request if we were to implement this directly in mdtraj?
Sorry, I didn't mean to come off too strong. It would clearly be better if mdtraj behaved better in this situation. Extremely long strides (relative to memory) weren't really on my mind when writing the code, which is why it's not optimized for this use case (yet). See e.g. https://github.com/mdtraj/mdtraj/blob/master/mdtraj/formats/xtc/xtc.pyx#L351-L359, where you can tell _read
just reads everything into the buffer, which is sliced aferwards.
Would you accept a pull request if we were to implement this directly in mdtraj?
Yeah, absolutely.
I can also help write it.
Just a high-level comment: I would love to see if we start contributing things to mdtraj. We are using mdtraj extensively and it would be very useful if one or two people in my lab had the expertise needed to tweak and extend it if we run into and limitations.
Am 04/08/15 um 19:23 schrieb Robert T. McGibbon:
Sorry, I didn't mean to come off too strong. It would clearly be better if mdtraj behaved better in this situation. Extremely long strides (relative to memory) we're really on my mind when writing the code, which is why it's not optimized for this use case (yet). See e.g. https://github.com/mdtraj/mdtraj/blob/master/mdtraj/formats/xtc/xtc.pyx#L351-L359, where you can tell |_read| just reads everything into the buffer, which is sliced aferwards.
Would you accept a pull request if we were to implement this directly in mdtraj?
Yeah, absolutely.
— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/480#issuecomment-127683679.
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
This week I discussed with @clonker about a related change to mdtraj. We were thinking about implementing index files for the xtc format and making xdrlib's ftell/fseek accessible to mdtraj. (Which I think is possible.) So there are some things that we could do together.
I would love to see random access implemented for xtcs, but AFAIK building the index might be more time consuming than just streaming in certain situations, or?
You don't have to build the whole index at once, just up to the point you need it and then later resume. Like that you won't loose (much) time compared to streaming in the first pass and be faster about it in subsequent passes.
Different formats have different strengths and weaknesses. It would also be nice to add support for the new gromacs TNG format, which supports seeking as well as good compression, to mdtraj. Might strike a better balance for some use cases.
Hi, any more thoughts on this? I'm running into the same problems of not being able to choose large strides because of -paradoxically- too large strides. So I am forced to lower strides, effectively slowing down my computation
Now that we've started several discussion on chunking, efficiency etc: https://github.com/markovmodel/PyEMMA/issues/616 https://github.com/markovmodel/PyEMMA/issues/604 it might be a good moment to revisit this, I think
Hi @franknoe , here are the "culprit" lines: https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L156 and https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L158
if extension not in _TOPOLOGY_EXTS:
traj = f.read_as_traj(topology, n_frames=chunk*stride, stride=stride, atom_indices=atom_indices, **kwargs)
else:
traj = f.read_as_traj(n_frames=chunk*stride, stride=stride, atom_indices=atom_indices, **kwargs)
This needs to be discussed with @marscher or @fabian-paul, because I don't know that part of the code.
Am 09/11/15 um 19:09 schrieb Guillermo Pérez-Hernández:
Hi @franknoe https://github.com/franknoe , here are the "culprit" lines: https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L156 and https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/util/patches.py#L158
if extensionnot in _TOPOLOGY_EXTS: traj= f.read_as_traj(topology,n_frames=chunk_stride,stride=stride,atom_indices=atom_indices,__kwargs) else: traj= f.read_as_traj(n_frames=chunk_stride,stride=stride,atom_indices=atom_indices,**kwargs)
— Reply to this email directly or view it on GitHub https://github.com/markovmodel/PyEMMA/issues/480#issuecomment-155143188.
Prof. Dr. Frank Noe Head of Computational Molecular Biology group Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354 Web: research.franknoe.de
This is still an issue. We really should use the approach which Robert pointed out. We need to collected all the frames via seek or just pass the desired indices to mdtraj.load (which will handle this efficiently via seeking in the future).
Random access patterns are already in place for the iterload wrapper, so we just need to generate the RA indices to access the file for very large strides to avoid the chunksize*stride > memory limitation. Since I also implemented fast seeking for xtc and most formats are seekable, this should be easy to add.
I'm aware that the entire transformation is jut to huge for memory, but with a (ridiculously) high enough stride I should be able to get 4124 frames x 57 dimensions, right?
However:
Perhaps I'm wrong, but 4124*1./1024 .ca 4 KB, right?