project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
221 stars 45 forks source link

REVDAT entry converted to PDB format #20

Open jsoerensen opened 4 years ago

jsoerensen commented 4 years ago

in the write_remarks function in to_pdb.hpp , could you add the REVDAT entry. I've pasted some code that should work below.

std::string token;
  std::istringstream revNum(st.get_info("_pdbx_audit_revision_history.ordinal"));
  std::vector<std::string> revNums;
  while(std::getline(revNum, token, ';'))
  {
    revNums.push_back(token);
  }

  std::istringstream revDate(st.get_info("_pdbx_audit_revision_history.revision_date"));
  std::vector<std::string> revDates;
  while(std::getline(revDate, token, ';'))
  {
    token.erase(std::remove_if(token.begin(), token.end(), ::isspace), token.end());
    revDates.push_back(token);
  }

  for(int i = (int) revNums.size() -1; i >= 0; --i)
  {
    WRITEU("REVDAT %3s   %-9s %-51s",
           revNums[i].c_str(), revDates[i].c_str(), st.get_info("_entry.id").c_str());
  }

and in mmcif.hpp

  std::string old_revnum_tag = "_database_PDB_rev.num";
  std::string new_revnum_tag = "_pdbx_audit_revision_history.ordinal";
  add_info(old_revnum_tag);
  add_info(new_revnum_tag);
  if (st.info.count(old_revnum_tag) == 1 && st.info.count(new_revnum_tag) == 0)
    st.info[new_revnum_tag] = st.info[old_revnum_tag];

  std::string old_revdate_tag = "_database_PDB_rev.date";
  std::string new_revdate_tag = "_pdbx_audit_revision_history.revision_date";
  add_info(old_revdate_tag);
  add_info(new_revdate_tag);
  if (st.info.count(old_revdate_tag) == 1 && st.info.count(new_revdate_tag) == 0)
    st.info[new_revdate_tag] = st.info[old_revdate_tag];
wojdyr commented 4 years ago

Looking at it, translating revision record would be problematic.

For example, in 6LU7:

loop_
_pdbx_audit_revision_history.ordinal 
_pdbx_audit_revision_history.data_content_type 
_pdbx_audit_revision_history.major_revision 
_pdbx_audit_revision_history.minor_revision 
_pdbx_audit_revision_history.revision_date 
1 'Structure model' 1 0 2020-02-05 
2 'Structure model' 2 0 2020-02-12 
3 'Structure model' 2 1 2020-02-19 
4 'Structure model' 2 2 2020-02-26 
5 'Structure model' 2 3 2020-03-11 
# 
loop_
_pdbx_audit_revision_details.ordinal 
_pdbx_audit_revision_details.revision_ordinal 
_pdbx_audit_revision_details.data_content_type 
_pdbx_audit_revision_details.provider 
_pdbx_audit_revision_details.type 
_pdbx_audit_revision_details.description 
_pdbx_audit_revision_details.details 
1 1 'Structure model' repository 'Initial release'        ?                 ? 
2 2 'Structure model' author     'Coordinate replacement' 'Ligand geometry' ? 
# 
loop_
_pdbx_audit_revision_group.ordinal 
_pdbx_audit_revision_group.revision_ordinal 
_pdbx_audit_revision_group.data_content_type 
_pdbx_audit_revision_group.group 
1  2 'Structure model' Advisory                 
2  2 'Structure model' 'Atomic model'           
3  2 'Structure model' 'Data collection'        
4  2 'Structure model' 'Database references'    
5  2 'Structure model' 'Derived calculations'   
6  2 'Structure model' 'Refinement description' 
7  2 'Structure model' 'Structure summary'      
8  3 'Structure model' 'Database references'    
9  3 'Structure model' 'Structure summary'      
10 4 'Structure model' 'Data collection'        
11 5 'Structure model' 'Source and taxonomy'    
12 5 'Structure model' 'Structure summary'      
# 
loop_
_pdbx_audit_revision_category.ordinal 
_pdbx_audit_revision_category.revision_ordinal 
_pdbx_audit_revision_category.data_content_type 
_pdbx_audit_revision_category.category 
1  2 'Structure model' atom_site                    
2  2 'Structure model' citation                     
3  2 'Structure model' entity                       
4  2 'Structure model' pdbx_nonpoly_scheme          
5  2 'Structure model' pdbx_struct_assembly_prop    
6  2 'Structure model' pdbx_struct_sheet_hbond      
7  2 'Structure model' pdbx_struct_special_symmetry 
8  2 'Structure model' pdbx_validate_rmsd_bond      
9  2 'Structure model' pdbx_validate_symm_contact   
10 2 'Structure model' pdbx_validate_torsion        
11 2 'Structure model' refine                       
12 2 'Structure model' refine_hist                  
13 2 'Structure model' refine_ls_shell              
14 2 'Structure model' software                     
15 2 'Structure model' struct                       
16 2 'Structure model' struct_conn                  
17 2 'Structure model' struct_site                  
18 2 'Structure model' struct_site_gen              
19 3 'Structure model' citation                     
20 3 'Structure model' struct                       
21 4 'Structure model' diffrn_detector              
22 5 'Structure model' entity                       
23 5 'Structure model' entity_src_gen               
24 5 'Structure model' struct                       
# 
loop_
_pdbx_audit_revision_item.ordinal 
_pdbx_audit_revision_item.revision_ordinal 
_pdbx_audit_revision_item.data_content_type 
_pdbx_audit_revision_item.item 
1  2 'Structure model' '_citation.title'                                
2  2 'Structure model' '_entity.pdbx_number_of_molecules'               
3  2 'Structure model' '_pdbx_struct_assembly_prop.value'               
4  2 'Structure model' '_pdbx_struct_sheet_hbond.range_1_auth_comp_id'  
...

would need to be translated to:

REVDAT   5   11-MAR-20 6LU7    1       COMPND SOURCE                            
REVDAT   4   26-FEB-20 6LU7    1       REMARK                                   
REVDAT   3   19-FEB-20 6LU7    1       TITLE  JRNL                              
REVDAT   2   12-FEB-20 6LU7    1       TITLE  COMPND JRNL   REMARK              
REVDAT   2 2                   1       SHEET  LINK   SITE   ATOM                
REVDAT   1   05-FEB-20 6LU7    0                                                

Out of curiosity, what do you need it for?

jsoerensen commented 4 years ago

I think the code I posted does something close to that. Although I've left out the last column with revision reasons, but I could add that in. Mainly, we store the original date, and the last revision number and date in metadata in when deposit these in a database. Since structures can be revised, it's important for us to know which revision we currently have.

jsoerensen commented 4 years ago

I don't mind posting the above as a PR with the extra column for the revision reason added, if that helps.

wojdyr commented 4 years ago

I meant that the last columns of REVDAT would be difficult to generate. How would you do it? From category?

jsoerensen commented 4 years ago

Ah that is a fair point - I'm not sure if the RCSB has a conversion table. I'll look.

wojdyr commented 4 years ago

if I may ask - why do you switch between mmCIF and PDB?

jsoerensen commented 4 years ago

It's a fair question, we convert the MMCIF header to a PDB-style header for historical reasons. The work involved switching our current codebase to parse each natively would be significant. And we only need to do this for those structure where there is only an MMCIF structure and not a corresponding PDB form.

jsoerensen commented 4 years ago

Sadly, there doesn't seem to be a proper mapping between the PDB and MMCIF notations. http://mmcif.wwpdb.org/docs/pdb_to_pdbx_correspondences.html#REVDAT

wojdyr commented 4 years ago

I'm inclined to leave REVDAT out, at least for now. Maybe you can find a workaround to store the last revision number and date in the database.

jsoerensen commented 4 years ago

Given the limited mappings from the wwpdb, I agree with you. The code above does do that I need so I have it on a fork. I’d be much happier prioritizing the DBREF data instead.