keitaroyam / servalcat

Structure refinement and validation for crystallography and single particle analysis
Mozilla Public License 2.0
23 stars 3 forks source link

poly.type not recognized by the wwPDB servers #19

Open daniel-s-d-larsson opened 1 month ago

daniel-s-d-larsson commented 1 month ago

I have a problem that the wwPDB deposition server and also the wwPDB validation server misidentifies my polymer chains as polyribonucleotide instead of polypeptide(L) when I upload the refined.mmcif file from refine_spa_norefmac. The poly.type is set correctly in the header, but the wwPDB staff says it is wrong on their side.

Example:

loop_
_entity_poly.entity_id
_entity_poly.type
_entity_poly.pdbx_strand_id
_entity_poly.pdbx_seq_one_letter_code
A polyribonucleotide A ?
B polypeptide(L)     B ?
C polypeptide(L)     C ?
D polypeptide(L)     D ?
E polypeptide(L)     E ?
...

This is a ribosome structure with both ribonucleic acids and many proteins. Strangely, only protein chains up til a specific point are identified as polyribonucleotide, which indicate to me that there is some corruption in the file. Could the lack of TER records cause this problem? In an earlier deposition, which used an older version of refine_spa, the polymer type records were not there.

wojdyr commented 1 month ago

If you mean the lack of TER in mmCIF, it's fine, only PDB files have TER records.

The software used on the PDB servers discards and regenerates some information; it may easily happen that correct annotation is replaced with incorrect one. To investigate it, we'd need an example file that demonstrates the problem.

daniel-s-d-larsson commented 1 month ago

How can I send you the file if I don't want to post it here? Maybe I could remove the coordinate columns...

wojdyr commented 1 month ago

You could use send it by email (wojdyr@gmail.com). Editing the file requires some work on your side, but would also be fine – perhaps it'd suffice to include only a few residues in each chain.

keitaroyam commented 1 month ago

Please also send it to me, or let Marcin share it with me. Did the PDB staff explain what was wrong?

daniel-s-d-larsson commented 1 month ago

Ok, I will send you the problematic file. I will also ask the wwPDB staff to describe the problem in detail.

daniel-s-d-larsson commented 1 month ago

Now I have tried uploading different modified versions of the mmcif file to the deposition and validation servers, including running the files through the PDB extract server, and I cannot figure out exactly what is causing the problem. For the time being, I cannot waste more time on this issue, but everything points at it being the PDB servers reading the poly.type records incorrectly or mapping to the chains incorrectly. My workaround is to delete the section entirely before uploading to a fresh deposition session.

wojdyr commented 1 month ago

Wouldn't it be easier to just send us the file? (Never mind, Keitaro reproduced it using 7k00)

wojdyr commented 1 month ago

It seems to be a bug in maxit. I compiled v11.200 and reduced 7k00 to ~500 lines input.mmcif.gz to reproduce it.

The input file has:

loop_
_entity.id
_entity.type
A     polymer
B     polymer

loop_
_entity_poly.entity_id
_entity_poly.type
A polyribonucleotide
B polypeptide(L)

Running:

maxit -input input.mmcif -output output.cif -o 8

produces output.cif with:

loop_
_entity_poly.entity_id 
_entity_poly.type 
_entity_poly.nstd_linkage 
_entity_poly.nstd_monomer 
_entity_poly.pdbx_seq_one_letter_code 
_entity_poly.pdbx_seq_one_letter_code_can 
_entity_poly.pdbx_strand_id 
_entity_poly.pdbx_target_identifier 
1 polyribonucleotide no no AAUUGAAGA   AAUUGAAGA   A ? 
2 polyribonucleotide no no VSMRDMLKAGV VSMRDMLKAGV B ? 
# 
loop_
_entity_poly_seq.entity_id 
_entity_poly_seq.num 
_entity_poly_seq.mon_id 
_entity_poly_seq.hetero 
1 1  A   n 
1 2  A   n 
1 3  U   n 
1 4  U   n 
1 5  G   n 
1 6  A   n 
1 7  A   n 
1 8  G   n 
1 9  A   n 
2 1  VAL n 
2 2  SER n 
2 3  MET n 
2 4  ARG n 
2 5  ASP n 
2 6  MET n 
2 7  LEU n 
2 8  LYS n 
2 9  ALA n 
2 10 GLY n 
2 11 VAL n 
# 
loop_
_entity.id 
_entity.type 
_entity.src_method 
_entity.pdbx_description 
_entity.formula_weight 
_entity.pdbx_number_of_molecules 
_entity.pdbx_ec 
_entity.pdbx_mutation 
_entity.pdbx_fragment 
_entity.details 
1 polymer man 
;RNA (5'-R(P*AP*AP*UP*UP*GP*AP*AP*GP*A)-3')
;
2903.815 1 ? ? ? ? 
2 polymer man VAL-SER-MET-ARG-ASP-MET-LEU-LYS-ALA-GLY-VAL  1208.495 1 ? ? ? ? 
# 

All looks fine apart from:

1 polyribonucleotide no no AAUUGAAGA   AAUUGAAGA   A ? 
2 polyribonucleotide no no VSMRDMLKAGV VSMRDMLKAGV B ?

If I change the order of lines in the input to:

loop_
_entity_poly.entity_id
_entity_poly.type
B polypeptide(L)
A polyribonucleotide

then in the output I get:

1 "polypeptide(L)" no no AAUUGAAGA   AAUUGAAGA   A ? 
2 "polypeptide(L)" no no VSMRDMLKAGV VSMRDMLKAGV B ?

If _entity_poly.type is absent in the input file, it's correct in the output.

daniel-s-d-larsson commented 1 month ago

Good that you found the culprit. For the time being, I will just delete the poly.type section before I upload files to wwPDB.

keitaroyam commented 1 month ago

Just noticed maxit-v11.300 has been released https://sw-tools.rcsb.org/apps/MAXIT/source.html and it worked properly. Also tested https://validate-rcsb-1.wwpdb.org/, which seemed to still use an older maxit version.