madscatt / zazzie

development branch
GNU General Public License v3.0
2 stars 3 forks source link

PDBRX default mode doesn't add some of missing residues #117

Closed cjeong73 closed 6 years ago

cjeong73 commented 6 years ago

PDBRX default mode adds missing residues if only the missing residues are located at the N terminal of first chain and at the C terminal of last chain.

See the attachment for the details. Missed_missing_residue_of_pdbrx_defaultmode.docx

madscatt commented 6 years ago

Isn't this part of #116

cjeong73 commented 6 years ago

The tests for 5E3L were done at sassie-jetstream terminal mode. The others (1012, 1022, 4F87) came from the previous runs on onsager without loop refine option because Sassie-jetsteam failed at structure builder (pyrosetta) stage for 1012, 1022, and 4F87 due to loop refine option.

cjeong73 commented 6 years ago

To be more clear, dev branch of github was installed in onsager.nist.gov and pdbrx_refactor branch is running on sassie-jetstream.

madscatt commented 6 years ago

Okay, I condensed the problem down to this simple test script. The issue is in PDB Scan. The segname sequence is not transferred correctly from chain to segname. ` import sassie.build.pdbscan.pdbscan as pdbscan

pdbfile = 's.pdb'

mol = pdbscan.SasMolScan() mol.read_pdb(pdbfile) mol.run_scan()

print 'chain info' for k,v in mol.chain_info.sequence.iteritems(): chain_sequence = v print 'chain : ' + k + '\n\tsequence = %s\n' % (chain_sequence,)

print 'segname info' for k,v in mol.segname_info.sequence.iteritems(): segname_sequence = v print 'segname : ' + k + '\n\tsequence = %s\n' % (segname_sequence,) chain info chain : A sequence = [(1, 'GLU'), (2, 'GLU'), (3, 'GLU'), (8, 'SER'), (9, 'ASP'), (10, 'VAL'), (11, 'LEU'), (12, 'THR'), (13, 'VAL'), (14, 'GLY'), (15, 'GLY'), (16, 'GLY')]

chain : B sequence = [(2, 'GLU'), (3, 'GLU'), (8, 'SER'), (9, 'ASP'), (10, 'VAL'), (11, 'LEU'), (12, 'THR'), (13, 'VAL'), (14, 'GLY'), (15, 'GLY')]

segname info segname : A sequence = [(1, 'GLU'), (2, 'GLU'), (3, 'GLU'), (8, 'SER'), (9, 'ASP'), (10, 'VAL'), (11, 'LEU'), (12, 'THR'), (13, 'VAL')]

segname : B sequence = [(8, 'SER'), (9, 'ASP'), (10, 'VAL'), (11, 'LEU'), (12, 'THR'), (13, 'VAL'), (14, 'GLY'), (15, 'GLY')]

`

Using this simple PDB file.

s.pdb.zip

The mistake might be in "get_segment_sequence_info" in scanner.py.

dww100 commented 6 years ago

Okay, this appears to be an issue with a subtle change in np.array: seg_chain_map = np.array(zip(segnames, chains)) used to return an array of tuples I think, now we get a 2d array. This then changes the output of np.where which leads to all the trouble.

It also explains why I was saying the logic was the same between the two lines (the difference is in looking at one or both elements of the tuple). What it doesn't explain is the choice of that over turning chains into an array.

I'll think on this an see if I can come up with an elegant solution. I don't see why your supposed solution @madscatt would go wrong but it seems a bit hard to follow.

dww100 commented 6 years ago

That all seems a bit of a non sequitur without the information that the code in question is in the get_segment_sequence_info function in scanner.py

dww100 commented 6 years ago

Actually, scratch my previous lack of concern - I think it will only work for well matched segment/chains - i.e. when no separation of chains into new segments has occurred.

dww100 commented 6 years ago

Okay, this works as far as I can see:

Replace: seg_chain_map = np.array(zip(segnames, chains))

with

seg_chain_map = np.empty(len(chains), dtype='O')
tmp = zip(segnames, chains)
seg_chain_map[:] = tmp

In my original code. I need to check what state my git repo is in before making any commits on this.

cjeong73 commented 6 years ago

Tests were done in jetstream terminal mode using ABCD molecule (5E3L, A,B = protein, C,D = nucleic acid) Let’s say “m” is missing residues in input pdb.

1)All additional missing residues at N and C terminals of A and B were reported well by pdbscan in addition to the case of missing resiudes in C and D. So, pdbscan in refactor branch works well for proteins and nucleic acid. I will see the case of gap in the middle. 2) pdbrx default in terminal run identifies all missing cases ( like mAm B, mAm mB, mAm Bm, mAm mBm) for proteins and generated correct psf and pdb files. Awesome! As a note, however missing residues in C or D were absent in pdbrx output psf and pdb files while pdbscan identified the missing residues.

dww100 commented 6 years ago

C and D are nucleic IIRC? If so I think that is a different issue.

I see that there was an error that @madscatt picked up - I have fixed that in my edit which I will submit as a pull request in a minute. It now fixes s.pdb using PDBRx