kalekundert / macromol_census

GNU General Public License v3.0
0 stars 0 forks source link

Prefer large biological assemblies #1

Closed kalekundert closed 6 months ago

kalekundert commented 7 months ago

Right now, I ingest the subchains that make up each assembly, and prefer assemblies that use more subchains at once (by solving the set cover problem). However, I didn't consider cases where there are several assemblies of different sizes that all use the same subchains.

This case comes up in capsid structures, e.g. 1a34. There aren't that many chains in the asymmetric unit, but they can be combined into variously sized fragments of the whole capsid. For my purposes, I want the full capsid, because that's the most likely to contain meaningful images.

In order to account for this, I'll need to do the following:

kalekundert commented 6 months ago

Actually, it's not so simple as just taking the biggest assembly. 2gtl is a good example:

loop_
_pdbx_struct_assembly.id 
_pdbx_struct_assembly.details 
_pdbx_struct_assembly.method_details 
_pdbx_struct_assembly.oligomeric_details 
_pdbx_struct_assembly.oligomeric_count 
1 'complete point assembly'                ? 'complete point assembly' 180 
2 'point asymmetric unit'                  ? pentadecameric            15  
3 'point asymmetric unit, std point frame' ? pentadecameric            15  
4 'crystal asymmetric unit'                ? 360-meric                 360 
# 
loop_
_pdbx_struct_assembly_gen.assembly_id 
_pdbx_struct_assembly_gen.oper_expression 
_pdbx_struct_assembly_gen.asym_id_list 
1 '(1-12)'        A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA 
2 1               A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA 
3 P               A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA 
4 '(X0,X1)(1-12)' A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA 

The biggest assembly is 4. But as specified in _pdbx_struct_assembly.details, it's a reconstruction of the crystal asymmetric unit. The "real" biological assembly is 1.

It's also worth noting that the _pdbx_struct_assembly table just gives the number of oligomers per assembly, so I don't necessarily need to calculate that myself by parsing the operation expressions.

kalekundert commented 6 months ago

The documentation for the _pdbx_struct_assembly.details key is informative:

In the PDB, 'representative helical assembly', 'complete point assembly', 'complete icosahedral assembly', 'software_defined_assembly', 'author_defined_assembly', and 'author_and_software_defined_assembly' are considered "biologically relevant assemblies".

So, my best bet is just to look for one of these terms, and prefer it if possible. I'm not sure what to do in cases where none of those terms are used. I'll have to look through to PDB to see if that ever happens. Every structure I can remember looking at has used one of those terms.

kalekundert commented 6 months ago

I've now done a survey of all the assemblies in the PDB, see expt 54. Some results:

Algorithm for filtering/ranking assemblies:

kalekundert commented 6 months ago

Implemented by 1d60b17