informatics-isi-edu / pdb-ihm

Deriva Protein Database Project
2 stars 1 forks source link

mmCIF file generation #41

Closed brindakv closed 2 years ago

brindakv commented 4 years ago

Requirements for extracting catalog data into mmCIF (to be used as an input to a separate pdb system)

Some examples:

loop_ _entity.id _entity.type _entity.src_method _entity.pdbx_description _entity.formula_weight _entity.pdbx_number_ofmolecules 1 polymer man "C1q subunits A, C, and B" 45697.594 1 2 non-polymer man N-ACETYL-D-GLUCOSAMINE 221.208 1 3 non-polymer syn 'CALCIUM ION' . 1 # loop _entity_poly.entity_id _entity_poly.type _entity_poly.nstd_linkage _entity_poly.nstd_monomer _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_codecan 1 'polypeptide(L)' no no ;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS GFLLFPDMEA ; ;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS GFLLFPDMEA ; # loop _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id _entity_poly_seq.hetero 1 1 LYS n 1 2 ASP n 1 3 GLN n 1 4 PRO n 1 5 ARG n 1 6 PRO n 1 7 ALA n 1 8 PHE n 1 9 SER n 1 10 ALA n 1 11 ILE n 1 12 ARG n 1 13 ARG n 1 14 ASN n 1 15 PRO n #

hongsudt commented 4 years ago

For ;,

In the above example, I assume that the first ;*; is for _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_code and the second one is for _entity_poly.pdbx_seq_one_letter_code_entity_poly.pdbx_seq_one_letter_code_can?

brindakv commented 4 years ago
svoinea commented 4 years ago

@brindakv Please review the file PDBDEV_00000009.txt which represents the output for structure_id=PDBDEV_00000009.

svoinea commented 4 years ago

The new file PDBDEV_00000009.txt is sorting the rows based on the PK and has included at the bottom the PDBDEV_00000009.cif file.

brindakv commented 4 years ago

Some tables need to be appended from the mmCIF file uploaded by the user:

hongsudt commented 4 years ago

Are you implying that we throw away data from mmCIF when we processed it in step2?


From: Brinda Vallat notifications@github.com Sent: Thursday, July 16, 2020 11:18 AM To: informatics-isi-edu/protein-database protein-database@noreply.github.com Cc: Hongsuda Tangmunarunkit hongsuda@isi.edu; Comment comment@noreply.github.com Subject: Re: [informatics-isi-edu/protein-database] mmCIF file generation (#41)

Some tables need to be appended from the mmCIF file uploaded by the user:

Not all entries will have all of the the above tables.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/protein-database/issues/41#issuecomment-659586246, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADFIDB4GR5PC6DRDVSFJVTR35ABHANCNFSM4OSAS7RA.

brindakv commented 4 years ago

That data was loaded into the Deriva database in step 2. If users add / modify that data in steps 3 or 4, then the database is going to have the most up-to-date information than what was in the mmCIF file in step 2. The only information that is not in the database is the xyz coordinates, which needs to be taken from the mmCIF file in step 2.

svoinea commented 4 years ago

For the structure_id=PDBDEV_00000009, my script detected the following additional tables present in the PDBDEV_00000009.cif file:

    entry
    pdbx_database_status
    pdbx_audit_revision_history
    pdbx_audit_revision_details
    software
    pdbx_poly_seq_scheme
    ihm_starting_model_coord
    atom_site

The new export output is PDBDEV_00000009.txt So, my script is first writing the data of the tables from the deriva database, and then it selects from the PDBDEV_00000009.cif file the data of the tables that are NOT present the deriva database.

svoinea commented 4 years ago

I have set the entry table from the deriva database as the first table in the export output:

loop_
_entry.id
PDBDEV_00000009
#

I am uploading the zip file that contains the PDBDEV_00000009.cif file: PDBDEV_00000009.cif.zip

brindakv commented 4 years ago

The loop_ directive is not required for the entry table because by convention an mmCIF only contains one entry. There may be other software applications that assume this convention as a rule.

svoinea commented 4 years ago

The new file PDBDEV_00000009.cif.zip has the _entry table w/o the _loop directive and includes in the exported file only the rows of the mmCIF file that belong to the tables specified above.

brindakv commented 4 years ago

There are two consecutive loop_ directives on lines 1200/1201 and 3081/3082 in the latest mmCIF file. These lead to syntax errors while validating.

svoinea commented 4 years ago

Sorry for the inconvenience. I don't have yet the validator. I am re-uploading the latest result with the correction PDBDEV_00000009.cif.zip

brindakv commented 4 years ago

Thank you @svoinea. The latest mmCIF file validates.

brindakv commented 4 years ago

I tried validating mmCIF files for different entries. Some of these have data with single / double quotes. They all validate.

brindakv commented 3 years ago

mmCIF file generated for entry D_1-P1ZE does not validate with the mmCIF dictionary tool.

The error is in handling text which contains quotes in the ihm_modeling_protocol table.

svoinea commented 2 years ago

We detected that different kinds of double quotes characters (", “, ...), specific to different languages, are treated similar by the mmCIF validation tool. That breaks the rule:

 if the value does not contain `"` , then enquote the value by `"`

As a workaround, my script implements that rule for any double quotes characters.

svoinea commented 2 years ago

mmCIF allows using the . character for default values of columns. If the JSON file does not contain a column that:

- has the constraint `NOT NULL`,
- is not a primary or foreign key
- is of type `text`

then, the backend script will set the value . for it.

Currently, the following columns are candidates for it:

{
    "ihm_starting_model_details": [
        "starting_model_auth_asym_id"
    ],
    "pdbx_protein_info": [
        "name"
    ],
    "pdbx_inhibitor_info": [
        "name"
    ],
    "software": [
        "classification",
        "name"
    ],
    "ihm_dataset_related_db_reference": [
        "accession_code"
    ],
    "ihm_starting_comparative_models": [
        "starting_model_auth_asym_id",
        "template_auth_asym_id"
    ],
    "ihm_chemical_component_descriptor": [
        "auth_name"
    ],
    "audit_author": [
        "name"
    ],
    "struct_ref": [
        "db_code"
    ],
    "ihm_probe_list": [
        "probe_name"
    ],
    "ihm_external_reference_info": [
        "reference"
    ],
    "chem_comp_atom": [
        "type_symbol"
    ],
    "ihm_starting_model_seq_dif": [
        "db_asym_id",
        "db_comp_id"
    ],
    "pdbx_ion_info": [
        "name"
    ]
}
svoinea commented 2 years ago

The mmCIF has some columns of type ucode that allows you specify a text case insensitive value. If the column is also a foreign key, then it needs to be synchronized with the values from the reference table.

Currently, the backend script treats such cases only for the following vocabulary tables that have only the Yes/No values:

{
    "ihm_poly_residue_feature_interface_residue_flag": [
        "NO",
        "YES"
    ],
    "ihm_modeling_protocol_details_ensemble_flag": [
        "NO",
        "YES"
    ],
    "pseudo_site_flag": [
        "Yes",
        "No"
    ],
    "ihm_modeling_protocol_details_multi_state_flag": [
        "NO",
        "YES"
    ],
    "ihm_probe_list_reactive_probe_flag": [
        "no",
        "yes"
    ],
    "ihm_3dem_restraint_map_segment_flag": [
        "NO",
        "YES"
    ],
    "ihm_dataset_list_database_hosted": [
        "NO",
        "YES"
    ],
    "ihm_modeling_protocol_details_ordered_flag": [
        "NO",
        "YES"
    ],
    "ihm_2dem_class_average_restraint_image_segment_flag": [
        "NO",
        "YES"
    ],
    "ihm_modeling_protocol_details_multi_scale_flag": [
        "NO",
        "YES"
    ],
    "ihm_poly_probe_position_modification_flag": [
        "no",
        "yes"
    ],
    "ihm_poly_probe_position_mutation_flag": [
        "no",
        "yes"
    ],
    "sub_sample_flag": [
        "Yes",
        "No"
    ],
    "ihm_sas_restraint_profile_segment_flag": [
        "NO",
        "YES"
    ],
    "ihm_poly_probe_conjugate_ambiguous_stoichiometry_flag": [
        "no",
        "yes"
    ]
}