Closed kalekundert closed 6 months ago
Actually, it's not so simple as just taking the biggest assembly. 2gtl
is a good example:
loop_
_pdbx_struct_assembly.id
_pdbx_struct_assembly.details
_pdbx_struct_assembly.method_details
_pdbx_struct_assembly.oligomeric_details
_pdbx_struct_assembly.oligomeric_count
1 'complete point assembly' ? 'complete point assembly' 180
2 'point asymmetric unit' ? pentadecameric 15
3 'point asymmetric unit, std point frame' ? pentadecameric 15
4 'crystal asymmetric unit' ? 360-meric 360
#
loop_
_pdbx_struct_assembly_gen.assembly_id
_pdbx_struct_assembly_gen.oper_expression
_pdbx_struct_assembly_gen.asym_id_list
1 '(1-12)' A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA
2 1 A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA
3 P A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA
4 '(X0,X1)(1-12)' A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,AA,BA,CA,DA,EA,FA,GA,HA,IA,JA,KA,LA,MA,NA,OA,PA,QA,RA
The biggest assembly is 4. But as specified in _pdbx_struct_assembly.details
, it's a reconstruction of the crystal asymmetric unit. The "real" biological assembly is 1.
It's also worth noting that the _pdbx_struct_assembly
table just gives the number of oligomers per assembly, so I don't necessarily need to calculate that myself by parsing the operation expressions.
The documentation for the _pdbx_struct_assembly.details
key is informative:
In the PDB, 'representative helical assembly', 'complete point assembly', 'complete icosahedral assembly', 'software_defined_assembly', 'author_defined_assembly', and 'author_and_software_defined_assembly' are considered "biologically relevant assemblies".
So, my best bet is just to look for one of these terms, and prefer it if possible. I'm not sure what to do in cases where none of those terms are used. I'll have to look through to PDB to see if that ever happens. Every structure I can remember looking at has used one of those terms.
I've now done a survey of all the assemblies in the PDB, see expt 54. Some results:
1jsd
has two relevant assemblies. Both have all the same subchains, but one is a monomer and the other is a trimer. The trimer is the "real" assembly".6nax
has two assemblies. Each is a different instance of the same monomer in the asymmetric unit.6dwu
is notable for having 44 assemblies, more than any other structure. These are all dimers involving two of the subchains in the asymmetric unit. Cases like this are when I need to do the set-cover problem, then break ties by order in the PDB.Algorithm for filtering/ranking assemblies:
Implemented by 1d60b17
Right now, I ingest the subchains that make up each assembly, and prefer assemblies that use more subchains at once (by solving the set cover problem). However, I didn't consider cases where there are several assemblies of different sizes that all use the same subchains.
This case comes up in capsid structures, e.g.
1a34
. There aren't that many chains in the asymmetric unit, but they can be combined into variously sized fragments of the whole capsid. For my purposes, I want the full capsid, because that's the most likely to contain meaningful images.In order to account for this, I'll need to do the following:
mmc_find_assembly_subchain_cover
to make a total ranking of assemblies. Continue to discard those assemblies that aren't part of the minimal set cover, but then rank those that remain on the size of the assembly. Break ties by the order they appear in the PDB file.mmc_pick_assemblies
to account for this ranking.