prody / ProDy

A Python Package for Protein Dynamics Analysis
http://prody.csb.pitt.edu
Other
434 stars 157 forks source link

[Ensemble Analysis -Heterogeneous X-ray Structures-PCA calculations] ValueError: array must not contain infs or NaNs #1719

Closed Lyueyang2020 closed 1 year ago

Lyueyang2020 commented 1 year ago

Hello, I am new to using ProDy so I apologize if this is a trivial question. When I use this module [Ensemble Analysis -Heterogeneous X-ray Structures-PCA calculations], the following problems occurred:

ValueError: array must not contain infs or NaNs

Detailed error is:

ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_41096\3647891085.py in 
      5 pca.buildCovariance(ensemble)
----> 7 pca.calcModes()

…site-packages\prody\dynamics\pca.py in calcModes(self, n_modes, turbo)
    215 
    216         values, vectors, _ = solveEig(self._cov, n_modes=n_modes, zeros=True, 
--> 217                                       turbo=turbo, reverse=True)
    218         which = values > ZERO
    219         self._eigvals = values[which]

…site-packages\prody\utilities\eigtools.py in solveEig(M, n_modes, zeros, turbo, expct_n_zeros, reverse)
     86         return n_zeros
     87 
---> 88     values, vectors = _eigh(M, eigvals, turbo)
     89     n_zeros = sum(values < ZERO)
     90 

…site-packages\prody\utilities\eigtools.py in _eigh(M, eigvals, turbo)
     39                 turbo = False
     40             if not issparse(M):
---> 41                 values, vectors = linalg.eigh(M, turbo=turbo, eigvals=eigvals)
...
--> 489             "array must not contain infs or NaNs")
    490     return a
    491 

ValueError: array must not contain infs or NaNs

Very sorry I don’t have a programming background, so I don’t have the ability to solve this problem. Thank you very much for your answer.

jamesmkrieger commented 1 year ago

Could you please provide the code you used to help us solve this. Also, which version of prody are you using?

many thanks

Lyueyang2020 commented 1 year ago

ProDy 2.4.0 Python 3.7.9 matplotlib 3.5.3

from prody import *
from pylab import *
ion()
pdbids = ['1AO6','1BJ5','1BKE','1BM0','1E78','1E7A','1E7B','1E7C','1E7E',
          '1E7F','1E7G','1E7H','1E7I','1GNI','1GNJ','1H9Z','1HA2','1HK1'...]
pdbfiles = fetchPDB(*pdbids, compressed=False)

ref_structure = parsePDB('1ao6.pdb')
ref_selection = ref_structure.select('resnum 5 to 582')

ref_chain = ref_selection.getHierView().getChain('A')
repr(ref_chain)

startLogfile('hsa_pca')
ensemble = PDBEnsemble('hsa X-ray')
ensemble.setAtoms(ref_chain)
ensemble.setCoords(ref_chain)
for pdbfile in pdbfiles:

     structure = parsePDB(pdbfile, subset='calpha')

     mappings = mapOntoChain(structure, ref_chain)
     atommap = mappings[0][0]

     ensemble.addCoordset(atommap, weights=atommap.getFlags('mapped'))

repr(ensemble)
len(ensemble) == len(pdbfiles)

ensemble.iterpose()

closeLogfile('hsa_pca')

writePDB('hsa_xray_ensemble.pdb', ensemble)

pca = PCA('hsa xray')
pca.buildCovariance(ensemble)
pca.calcModes()

Thank you very much for your reply. The above is the content learned from the tutorial.

jamesmkrieger commented 1 year ago

Does your ensemble pdb look right?

Lyueyang2020 commented 1 year ago

You are right. The ensemble pdb file is abnormal. I tried the web tutorial and it worked (add chain= 'A', but this has limitations, not all target protein structures are chain A). But I still don't know what the problem is. Thanks for your reply!

#1.1 
from prody import *
from pylab import *
ion()
pdbids = ['1AO6','1BJ5','1BKE','1BM0','1E78','1E7A','1E7B','1E7C','1E7E','1E7F',
          '1E7G','1E7H','1E7I','1GNI','1GNJ','1H9Z','1HA2','1HK1','1HK2','1HK3',...]
structures = parsePDB(pdbids, **chain='A'**,subset='ca', compressed=False)
#1.2 Set reference chain
ref_structure = parsePDB('1ao6.pdb')
ref_selection = ref_structure.select('resnum 5 to 582')
ref_chain = ref_selection.getHierView().getChain('A')
repr(ref_chain)
#1.3 Prepare ensemble
ensemble = buildPDBEnsemble(structures, title='hsa X-ray')
Ensemble
ensemble.iterpose()
writePDB('hsa_xray_ensemble.pdb', ensemble)
#PCA 
pca = PCA('hsa xray') 
pca.buildCovariance(ensemble)
pca.calcModes() 
jamesmkrieger commented 1 year ago

You may want to try changing the mapping function, seqid, overlap or rmsd_reject.

sorry we haven’t had a chance to try running it and see what happens ourselves

jamesmkrieger commented 1 year ago

You can also try using scipion-em-prody to run all these things with a graphical user interface in the scipion workflow engine. It can be a bit of a pain to install but it’s not so bad if you just want it for this.

jamesmkrieger commented 1 year ago

You may also just want to try using the development version of ProDy from GitHub rather than 2.4.0 from pip as we have fixed some things

jamesmkrieger commented 1 year ago

ProDy 2.4.0

Python 3.7.9 matplotlib 3.5.3

from prody import *
from pylab import *
ion()
pdbids = ['1AO6','1BJ5','1BKE','1BM0','1E78','1E7A','1E7B','1E7C','1E7E',
          '1E7F','1E7G','1E7H','1E7I','1GNI','1GNJ','1H9Z','1HA2','1HK1'...]
pdbfiles = fetchPDB(*pdbids, compressed=False)

ref_structure = parsePDB('1ao6.pdb')
ref_selection = ref_structure.select('resnum 5 to 582')

ref_chain = ref_selection.getHierView().getChain('A')
repr(ref_chain)

startLogfile('hsa_pca')
ensemble = PDBEnsemble('hsa X-ray')
ensemble.setAtoms(ref_chain)
ensemble.setCoords(ref_chain)
for pdbfile in pdbfiles:

     structure = parsePDB(pdbfile, subset='calpha')

     mappings = mapOntoChain(structure, ref_chain)
     atommap = mappings[0][0]

     ensemble.addCoordset(atommap, weights=atommap.getFlags('mapped'))

repr(ensemble)
len(ensemble) == len(pdbfiles)

ensemble.iterpose()

closeLogfile('hsa_pca')

writePDB('hsa_xray_ensemble.pdb', ensemble)

pca = PCA('hsa xray')
pca.buildCovariance(ensemble)
pca.calcModes()

Thank you very much for your reply. The above is the content learned from the tutorial.

Removing the ... from the end of the pdb list and the ** around chain A, this worked fine for me

Removing the chain A, it says that the ensemble must have more than 2 coordinate sets for PCA. I'm getting warnings in the ensemble building that the structures have fewer chains than 1AO6_ca. I think one problem may be that you're not including your ref_chain as a reference

jamesmkrieger commented 1 year ago

Using the following code, I get the error about the covariance matrix containing NaN:

In [14]: from prody import *
    ...: from pylab import *
    ...: ion()
    ...: pdbids = ['1AO6','1BJ5','1BKE','1BM0','1E78','1E7A','1E7B','1E7C','1E7E','1E7F',
    ...:           '1E7G','1E7H','1E7I','1GNI','1GNJ','1H9Z','1HA2','1HK1','1HK2','1HK3']
    ...: structures = parsePDB(pdbids, subset='ca', compressed=False)
    ...: #1.2 Set reference chain
    ...: ref_structure = parsePDB('1ao6.pdb')
    ...: ref_selection = ref_structure.select('resnum 5 to 582')
    ...: ref_chain = ref_selection.getHierView().getChain('A')
    ...: repr(ref_chain)
    ...: #1.3 Prepare ensemble
    ...: ensemble = buildPDBEnsemble(structures, title='hsa X-ray', ref=ref_chain)
    ...: Ensemble
    ...: ensemble.iterpose()
    ...: writePDB('hsa_xray_ensemble.pdb', ensemble)
    ...: #PCA 
    ...: pca = PCA('hsa xray')
    ...: pca.buildCovariance(ensemble)
    ...: pca.calcModes()

The ensemble has many atoms besides the CA atoms that most of the structures have and that's what the problem.

jamesmkrieger commented 1 year ago

Using this code with subset ca for the reference structure too fixes it:

In [22]: from prody import *
    ...: from pylab import *
    ...: ion()
    ...: pdbids = ['1AO6','1BJ5','1BKE','1BM0','1E78','1E7A','1E7B','1E7C','1E7E','1E7F',
    ...:           '1E7G','1E7H','1E7I','1GNI','1GNJ','1H9Z','1HA2','1HK1','1HK2','1HK3']
    ...: structures = parsePDB(pdbids, subset='ca', compressed=False)
    ...: #1.2 Set reference chain
    ...: ref_structure = parsePDB('1ao6.pdb', subset='ca')
    ...: ref_selection = ref_structure.select('resnum 5 to 582')
    ...: ref_chain = ref_selection.getHierView().getChain('A')
    ...: repr(ref_chain)
    ...: #1.3 Prepare ensemble
    ...: ensemble = buildPDBEnsemble(structures, title='hsa X-ray', ref=ref_chain)
    ...: Ensemble
    ...: ensemble.iterpose()
    ...: writePDB('hsa_xray_ensemble.pdb', ensemble)
    ...: #PCA 
    ...: pca = PCA('hsa xray')
    ...: pca.buildCovariance(ensemble)
    ...: pca.calcModes()
jamesmkrieger commented 1 year ago

To reduce the risk of mistakes and to avoid having the reference structure included in the ensemble twice, you could also use the reference structure that has already been parsed rather than parsing it again.

You can also use a .copy() on the ref_chain to not include a selection for the reference because that can make problems with using setAtoms() later.

In [39]: from prody import *
    ...: from pylab import *
    ...: ion()
    ...: pdbids = ['1AO6','1BJ5','1BKE','1BM0','1E78','1E7A','1E7B','1E7C','1E7E','1E7F',
    ...:           '1E7G','1E7H','1E7I','1GNI','1GNJ','1H9Z','1HA2','1HK1','1HK2','1HK3']
    ...: structures = parsePDB(pdbids, subset='ca', compressed=False)
    ...: #1.2 Set reference chain
    ...: ref_structure = structures[0]
    ...: ref_selection = ref_structure.select('resnum 5 to 582')
    ...: ref_chain = ref_selection.getHierView().getChain('A').copy()
    ...: repr(ref_chain)
    ...: #1.3 Prepare ensemble
    ...: ensemble = buildPDBEnsemble(structures, title='hsa X-ray', ref=ref_chain)
    ...: Ensemble
    ...: ensemble.iterpose()
    ...: writePDB('hsa_xray_ensemble.pdb', ensemble)
    ...: #PCA 
    ...: pca = PCA('hsa xray')
    ...: pca.buildCovariance(ensemble)
    ...: pca.calcModes()
jamesmkrieger commented 1 year ago

I'd say this issue is solved then too. Please feel free to reopen if you disagree and/or need more help.