Could not allocate enough memory to map all data

rubinanoor9 commented 2 years ago

Hello, I am trying to find out vamp score for different features, When I am adding the features based on the heavy atom distances in the protein. I got following error:

pyemma.coordinates.data.feature_reader.FeatureReader[60] ERROR Could not allocate enough memory to map all data. Consider using a larger stride. Traceback (most recent call last): File "/home/reaz/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py", line 386, in get_output trajs = [np.full((l, ndim), np.nan, dtype=self.output_type()) for l in it.trajectory_lengths()] File "/home/reaz/miniconda3/envs/pyemma/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py", line 386, in trajs = [np.full((l, ndim), np.nan, dtype=self.output_type()) for l in it.trajectory_lengths()] File "/home/reaz/miniconda3/envs/pyemma/lib/python3.9/site-packages/numpy/core/numeric.py", line 343, in full a = empty(shape, dtype, order) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.99 GiB for an array with shape (500, 4290985) and data type float32

I am using this command line for that; feat = pyemma.coordinates.featurizer(pdb) heavy_atom_distance_pairs = feat.pairs(feat.select_Heavy()) feat.add_distances(heavy_atom_distance_pairs, periodic=False) reader = pyemma.coordinates.source(files, features=feat) data_output = reader.get_output(stride=5)

This problem has not been resolved by increasing the number of stride. As well I also get similar error when I am trying to add the distance features of all the Ca (alpha carbon) atoms. I am looking forward to hear from you. Thanks in advance Regard Rubina

clonker commented 2 years ago

Hi, based on the features you selected the resulting number of feature dimensions is 4290985. As the error message says, this leads to an allocation of 8GB of memory for just 500 frames and makes your computer run out of memory really fast. I suggest trying to pick different features so that the number of output dimensions is lower (for example not all pairs of heavy atoms but only a subset).

thempel commented 2 years ago

Depending on how how large your protein is, taking distances between all heavy atoms yields a very high-dimensional space - however, you may imagine that this space also contains a lot of redundant information. As a possible alternative, you may want to consider distances between c-alpha atoms or minimal distances between residues. I find the latter a good way to also include side-chain information without bloating up the feature space. You can add them with feat.add_residue_mindist(), cf. the docs.

You can check the number of feature dimensions (before loading the data) with feat.dimension(). In practice, I've seen this number mostly in the 100s - 1000s regime, max 10,000 - but that's just my experience.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

markovmodel / PyEMMA

Could not allocate enough memory to map all data #1528