Cluster.py: Problems reading array data

s-gordon commented 11 years ago

Hi all,

After completing the MSMBUILDER2 tutorials, I tried to apply the same steps involved to my own data set of trajectories. After converting these frames to XTC (I was having issues with the DCD reader) using a combination of catdcd and Gromac's trjconv tools, I converted my trajectories into .lh5 files using the ConvertDataToHDF.py script, which appeared to complete normally.

Next, I tried to cluster the data set using the command:

python2.7 ../scripts/Cluster.py rmsd hybrid -d 0.045 -l 50
...which worked for the Tutorial data set previously. This spits out the following error log:

--------------------------------------------------------------------------------

MSMBuilder version 2.6.0.dev-Unknown

See file AUTHORS for a list of MSMBuilder contributors.

--------------------------------------------------------------------------------

Copyright 2011 Stanford University.

MSMBuilder comes with ABSOLUTELY NO WARRANTY.

MSMBuilder is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

--------------------------------------------------------------------------------

Please cite the following references:

GR Bowman, X Huang, and VS Pande. Methods 2009. Using generalized ensemble 
simulations and Markov state models to identify conformational states.

KA Beauchamp, GR Bowman, TJ Lane, L Maibaum, IS Haque, VS Pande.  JCTC 2011.
MSMBuilder2: Modeling Conformational Dynamics
at the Picosecond to Millisecond Timescale

IS Haque, KA Beauchamp, VS Pande.  In preparation.
A Fast 3 x N Matrix Multiply Routine for Calculation of Protein RMSD.

--------------------------------------------------------------------------------
{'alg': 'hybrid',
 'hybrid_distance_cutoff': 0.045,
 'hybrid_global_iters': 0,
 'hybrid_ignore_max_objective': False,
 'hybrid_local_num_iters': 50,
 'hybrid_num_clusters': None,
 'hybrid_too_close_cutoff': 0.0001,
 'metric': 'rmsd',
 'output_dir': 'Data/',
 'project': 'ProjectInfo.yaml',
 'quiet': False,
 'rmsd_atom_indices': 'AtomIndices.dat',
 'stride': 1}
11:39:03 - RMSD metric - loading only the atom indices required
HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 139968538093312:
  #000: ../../../src/H5Dio.c line 153 in H5Dread(): selection+offset not within extent
    major: Dataspace
    minor: Out of range
Traceback (most recent call last):
  File "../scripts/Cluster.py", line 228, in <module>
    main(args, metric)
  File "../scripts/Cluster.py", line 203, in main
    trajs = load_trajectories(args.project, args.stride, atom_indices)
  File "../scripts/Cluster.py", line 121, in load_trajectories
    traj = project.load_traj(i, stride=stride, atom_indices=atom_indices)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/project/project.py", line 340, in load_traj
    AtomIndices=atom_indices)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 722, in load_trajectory_file
    return Trajectory.load_from_lhdf(Filename, JustInspect=JustInspect, Stride=Stride, AtomIndices=AtomIndices)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 627, in load_from_lhdf
    A = cls.load_from_hdf(TrajFilename, Stride=Stride, AtomIndices=AtomIndices)
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 596, in load_from_hdf
    chunk_list = list(cls.enum_chunks_from_hdf(TrajFilename, Stride=Stride, AtomIndices=AtomIndices, ChunkSize=ChunkSize))
  File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 507, in enum_chunks_from_hdf
    A['AtomID'] = np.array(F.root.AtomID[AtomIndices], dtype=np.int32)
  File "/usr/lib/python2.7/dist-packages/tables/array.py", line 689, in __getitem__
    arr = self._readCoords(coords)
  File "/usr/lib/python2.7/dist-packages/tables/array.py", line 792, in _readCoords
    self._g_readCoords(coords, nparr)
  File "hdf5Extension.pyx", line 1134, in tables.hdf5Extension.Array._g_readCoords (tables/hdf5Extension.c:9869)
tables.exceptions.HDF5ExtError: Problems reading the array data.
Closing remaining open files: /home/<user>/Downloads/msmbuilder-2.6.0_3/Tutorial/Trajectories/trj0.lh5... done

At the conclusion of this, no files are created in the ./Data directory.

I've farmed through the relevant scripts to try and diagnose what the issue might be, but nothing screams out.

The trajectories involve only a small molecule (9 atoms), and the total data set is equivalent to roughly 50 Mb.

I'd greatly appreciate it if anyone can help me figure out what the issue might be.

Cheers.

rmcgibbo commented 11 years ago

The .lh5 files are corrupted somehow -- the error is coming from libhdf5, which is a dependency of a dependency of msmbuilder.

HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 139968538093312:
  #000: ../../../src/H5Dio.c line 153 in H5Dread(): selection+offset not within extent
    major: Dataspace
    minor: Out of range

Do you have the command line program h5ls? (If you're using ubuntu, you can install it with sudo apt-get hdf5-tools. If you have enthough python, it should be installed by default.) h5ls is basically ls for the hdf5 format, so it should show you what data is in the file.

If so, can you try running

$ h5ls Trajectories/trj0.lh5

You can also try running this on one of the trajectories from the tutorial, and compare the output.

rmcgibbo commented 11 years ago

Another option is to try to do the conversion again. There could be a problem in your trajconv/catdcd pipeline. You might try using MDTraj's mdconvert: see http://rmcgibbo.github.io/mdtraj/ for details.

-Robert Sent from my iPhone.

On Sun, Jun 30, 2013 at 6:50 PM, gordo1 notifications@github.com wrote:

Hi all, After completing the MSMBUILDER2 tutorials, I tried to apply the same steps involved to my own data set of trajectories. After converting these frames to XTC (I was having issues with the DCD reader) using a combination of catdcd and Gromac's trjconv tools, I converted my trajectories into .lh5 files using the ConvertDataToHDF.py script, which appeared to complete normally. Next, I tried to cluster the data set using the command: python2.7 ../scripts/Cluster.py rmsd hybrid -d 0.045 -l 50

...which worked for the Tutorial data set previously. This spits out the following error log:

MSMBuilder version 2.6.0.dev-Unknown

See file AUTHORS for a list of MSMBuilder contributors.

Copyright 2011 Stanford University. MSMBuilder comes with ABSOLUTELY NO WARRANTY. MSMBuilder is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2

of the License, or (at your option) any later version.

Please cite the following references: GR Bowman, X Huang, and VS Pande. Methods 2009. Using generalized ensemble simulations and Markov state models to identify conformational states. KA Beauchamp, GR Bowman, TJ Lane, L Maibaum, IS Haque, VS Pande. JCTC 2011. MSMBuilder2: Modeling Conformational Dynamics at the Picosecond to Millisecond Timescale IS Haque, KA Beauchamp, VS Pande. In preparation.

A Fast 3 x N Matrix Multiply Routine for Calculation of Protein RMSD.

{'alg': 'hybrid', 'hybrid_distance_cutoff': 0.045, 'hybrid_global_iters': 0, 'hybrid_ignore_max_objective': False, 'hybrid_local_num_iters': 50, 'hybrid_num_clusters': None, 'hybrid_too_close_cutoff': 0.0001, 'metric': 'rmsd', 'output_dir': 'Data/', 'project': 'ProjectInfo.yaml', 'quiet': False, 'rmsd_atom_indices': 'AtomIndices.dat', 'stride': 1} 11:39:03 - RMSD metric - loading only the atom indices required HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 139968538093312:

000: ../../../src/H5Dio.c line 153 in H5Dread(): selection+offset not within extent
major: Dataspace
minor: Out of range
Traceback (most recent call last): File "../scripts/Cluster.py", line 228, in main(args, metric) File "../scripts/Cluster.py", line 203, in main trajs = load_trajectories(args.project, args.stride, atom_indices) File "../scripts/Cluster.py", line 121, in load_trajectories traj = project.load_traj(i, stride=stride, atom_indices=atom_indices) File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/project/project.py", line 340, in load_traj AtomIndices=atom_indices) File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 722, in load_trajectory_file return Trajectory.load_from_lhdf(Filename, JustInspect=JustInspect, Stride=Stride, AtomIndices=AtomIndices) File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 627, in load_from_lhdf A = cls.load_from_hdf(TrajFilename, Stride=Stride, AtomIndices=AtomIndices) File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 596, in load_from_hdf chunk_list = list(cls.enum_chunks_from_hdf(TrajFilename, Stride=Stride, AtomIndices=AtomIndices, ChunkSize=ChunkSize)) File "/usr/local/lib/python2.7/dist-packages/msmbuilder-2.6.0-py2.7-linux-x86_64.egg/msmbuilder/Trajectory.py", line 507, in enum_chunks_from_hdf A['AtomID'] = np.array(F.root.AtomID[AtomIndices], dtype=np.int32) File "/usr/lib/python2.7/dist-packages/tables/array.py", line 689, in getitem arr = self._readCoords(coords) File "/usr/lib/python2.7/dist-packages/tables/array.py", line 792, in _readCoords self._g_readCoords(coords, nparr) File "hdf5Extension.pyx", line 1134, in tables.hdf5Extension.Array._g_readCoords (tables/hdf5Extension.c:9869) tables.exceptions.HDF5ExtError: Problems reading the array data. Closing remaining open files: /home//Downloads/msmbuilder-2.6.0_3/Tutorial/Trajectories/trj0.lh5... done At the conclusion of this, no files are created in the ./Data directory. I've farmed through the relevant scripts to try and diagnose what the issue might be, but nothing screams out. The trajectories involve only a small molecule (9 atoms), and the total data set is equivalent to roughly 50 Mb. I'd greatly appreciate it if anyone can help me figure out what the issue might be.

Cheers.

Reply to this email directly or view it on GitHub: https://github.com/SimTk/msmbuilder/issues/217

kyleabeauchamp commented 11 years ago

It's very puzzling that you are able to convert the XTC files that come with MSMBuilder, but you are not able to convert the XTC files that your pipeline generates. Are you sure that your XTC files are properly formatted?

s-gordon commented 11 years ago

Thanks for the fast responses! rmcgibbo - I've just installed the hdf5-tools package and successfully tested h5ls as you described. The output using the tutorial data set is as follows:

AtomID Dataset {22/8192} AtomNames Dataset {22/16384} ChainID Dataset {22/65536} ResidueID Dataset {22/8192} ResidueNames Dataset {22/16384} XYZList Dataset {501, 22, 3}

When applying this to my own data set, I get the following output:

AtomID Dataset {9/8192} AtomNames Dataset {9/32768} ChainID Dataset {9/65536} ResidueID Dataset {9/8192} ResidueNames Dataset {9/16384} XYZList Dataset {5051, 9, 3}

...which matches up pretty well with what I got with the tutorial trajectory files.

MDTraj was my next point of reference. I'll give it a go and report back when I've got the results.

kyleabeauchamp commented 11 years ago

Do you have a single trajectory or several? Could it be that one trajectory is somehow corrupted?

If so, it might make sense to try loading the trajectories in an interactive python session, one by one.

R = Trajectory.load_from_lhdf("./Trajectories/trj0.lh5") etc

s-gordon commented 11 years ago

kyleabeauchamp - my thoughts exactly. At the moment I'm using catdcd to go from DCD -> TRR, then trjconvert to go from TRR -> XTC.

I haven't had a look at the internals of the XTC files yet.

I'm just following up on what you've suggested in your second comment. Will report back soon.

rmcgibbo commented 11 years ago

No, I've figured it out. Wait two seconds...

rmcgibbo commented 11 years ago

The problem is your AtomIndices.dat contains too many indices.

rmcgibbo@Roberts-MacBook-Pro-2 ~
$ cat test.py
import tables
import numpy as np

handle = tables.openFile('test.h5', 'w')

# save ten numbers to the file
handle.createArray(where='/', name='x', object=np.arange(10))

# read the numbers back out, but try to overread the buffer
indices_to_grab = np.arange(100)
handle.root.x[indices_to_grab]

rmcgibbo@Roberts-MacBook-Pro-2 ~
$ python test.py
HDF5-DIAG: Error detected in HDF5 (1.8.9) thread 0:
  #000: H5Dio.c line 153 in H5Dread(): selection+offset not within extent
    major: Dataspace
    minor: Out of range
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    handle.root.x[indices_to_grab]
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/tables/array.py", line 689, in __getitem__
    arr = self._readCoords(coords)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/tables/array.py", line 792, in _readCoords
    self._g_readCoords(coords, nparr)
  File "hdf5Extension.pyx", line 1134, in tables.hdf5Extension.Array._g_readCoords (tables/hdf5Extension.c:9869)
tables.exceptions.HDF5ExtError: Problems reading the array data.
Closing remaining open files: test.h5... done

rmcgibbo commented 11 years ago

I'm sort of surprised that pytables doesn't catch this exception in a nicer way and report it as an IndexError, but that's the same one that you reported. Presumably there are numbers greater than 8 in your AtomIndices.dat file?

s-gordon commented 11 years ago

Looks to be the case.

cat AtomIndices.dat 1 4 5 6 8 10 14 15 16 18

rmcgibbo commented 11 years ago

So the AtomIndices.dat file is supposed to list the (zero-based) indices of the atoms that you want to use in the RMSD computation. So if you have exchangeable atoms like methyl hydrogens that you want to discard, you wouldn't list them in that file. I'm not sure what your system is, but presumably if there are only 9 atoms you probably want to include them all? In that case, the file should just list the integers zero through eight.

s-gordon commented 11 years ago

Thanks a million everyone. I've been struggling with this for a few days. Hard to believe it was something so simple.

I've amended AtomIndices.dat to reflect relevant atoms in my small molecule, and everything appears to be running smoothly now with clustering.

kyleabeauchamp commented 11 years ago

We will fix this error message at the next release to be more informative.

rmcgibbo commented 11 years ago

It's only easy because we live inside this codebase, so we know most of the failure modes. For the record, I checked mdtraj (which msmbuilder is going to use in the near future), and it gives an informative error message here.

rmcgibbo commented 11 years ago

Okay. I'm going to close this. Looks like the issue was resolved.

aasthajhunjhunwala commented 4 years ago

Hi all, I've been facing issues trying to use .h5 file datasets. I keep getting this error for the code ds = dataset('traj-0000.h5') len(ds)

/anaconda3/envs/MDS/lib/python2.7/site-packages/tables/group.pyc in _g_check_has_child(self, name) 396 raise NoSuchNodeError( 397 "group %s does not have a child named %s" --> 398 % (self._v_pathname, name)) 399 return node_type 400

NoSuchNodeError: group / does not have a child named /arr_0

Even though the h5 file is in the same directory as the Jupyter notebook. Any help/fixes will be highly appreciated.

aasthajhunjhunwala commented 4 years ago

I think MSMBuilder doesn't account for the latest Pytables update. Here's a link to a Pytables [Similar issue] (https://github.com/PV-Lab/bayesim/issues/1)

msmbuilder / msmbuilder-legacy

Cluster.py: Problems reading array data #217

...which worked for the Tutorial data set previously. This spits out the following error log:

See file AUTHORS for a list of MSMBuilder contributors.

of the License, or (at your option) any later version.

A Fast 3 x N Matrix Multiply Routine for Calculation of Protein RMSD.

000: ../../../src/H5Dio.c line 153 in H5Dread(): selection+offset not within extent

Cheers.