mmCIF file generation - Githubissues

brindakv commented 4 years ago

Requirements for extracting catalog data into mmCIF (to be used as an input to a separate pdb system)

Data needs to be exported entry-wise i.e., only data belonging to a particular entry (as denoted by entry.id) is to be exported
Only tables and columns in the json schema need to be exported
Space or tabs can be used to separate column values in a row
If there is an optional column in a table and some rows in the table have values and some rows don't, then . can be used to denote missing values
Single or double quotes can be used for textual column values that contain spaces
Multi-line texts are enclosed within ; (see example below)
If the text contains single quotes, then they can be enclosed within double quotes and vice versa. If the text contains both single and double quotes, then they are enclosed within ; like multi-line texts.
- 'This is a "test"'
- "It's a test"
- When both are present, use ; e.g.
```
;This is a "test" and It's a test
; 
<next col>
```
# can be used to add empty lines between tables
When vocab tables are used, the corresponding values should be used to populate the mmCIF tables (see entity.type in the example below).
If a table returns zero rows for a particular structure_id, then the table need not be included in the mmCIF file
The structure_id column in each table need not be included in the mmCIF file

Default format for tables in mmcif:

 data_structure_id (use value of structure_id)

 loop_
 _table_name.column_name_1
 _table_name.column_name_2
 ...
 ...
 ...
 _table_name.column_name_n
 Row_1_column_value_1      Row_1_column_value_2 .........     Row_1_column_value_n
 ....
 ....
 ....
 Row_m_column_value_1      Row_m_column_value_2 .........     Row_m_column_value_n

Some examples:

loop_ _entity.id _entity.type _entity.src_method _entity.pdbx_description _entity.formula_weight _entity.pdbx_number_ofmolecules 1 polymer man "C1q subunits A, C, and B" 45697.594 1 2 non-polymer man N-ACETYL-D-GLUCOSAMINE 221.208 1 3 non-polymer syn 'CALCIUM ION' . 1 # loop _entity_poly.entity_id _entity_poly.type _entity_poly.nstd_linkage _entity_poly.nstd_monomer _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_codecan 1 'polypeptide(L)' no no ;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS GFLLFPDMEA ; ;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS GFLLFPDMEA ; # loop _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id _entity_poly_seq.hetero 1 1 LYS n 1 2 ASP n 1 3 GLN n 1 4 PRO n 1 5 ARG n 1 6 PRO n 1 7 ALA n 1 8 PHE n 1 9 SER n 1 10 ALA n 1 11 ILE n 1 12 ARG n 1 13 ARG n 1 14 ASN n 1 15 PRO n #

hongsudt commented 4 years ago

For ;,

Does the ; have to be the beginning of the line?
does the text always have to start right after the first ;?

In the above example, I assume that the first ;*; is for _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_code and the second one is for _entity_poly.pdbx_seq_one_letter_code_entity_poly.pdbx_seq_one_letter_code_can?

brindakv commented 4 years ago

Yes, ; has to be in the beginning of the line.
No, the text does not have to start right after the first ;
Yes, the first ;*; is for _entity_poly.pdbx_seq_one_letter_code and the second ;*; is for _entity_poly.pdbx_seq_one_letter_code_can

svoinea commented 4 years ago

@brindakv Please review the file PDBDEV_00000009.txt which represents the output for structure_id=PDBDEV_00000009.

svoinea commented 4 years ago

The new file PDBDEV_00000009.txt is sorting the rows based on the PK and has included at the bottom the PDBDEV_00000009.cif file.

brindakv commented 4 years ago

Some tables need to be appended from the mmCIF file uploaded by the user:

atom_site
ihm_starting_model_coord
ihm_sphere_obj_site
ihm_gaussian_obj_site
ihm_gaussian_obj_ensemble
pdbx_poly_seq_scheme
pdbx_nonpoly_scheme Not all entries will have all of the the above tables.

hongsudt commented 4 years ago

Are you implying that we throw away data from mmCIF when we processed it in step2?

From: Brinda Vallat notifications@github.com Sent: Thursday, July 16, 2020 11:18 AM To: informatics-isi-edu/protein-database protein-database@noreply.github.com Cc: Hongsuda Tangmunarunkit hongsuda@isi.edu; Comment comment@noreply.github.com Subject: Re: [informatics-isi-edu/protein-database] mmCIF file generation (#41)

Some tables need to be appended from the mmCIF file uploaded by the user:

atom_site
ihm_starting_model_coord
ihm_sphere_obj_site
ihm_gaussian_obj_site
ihm_gaussian_obj_ensemble

Not all entries will have all of the the above tables.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/protein-database/issues/41#issuecomment-659586246, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADFIDB4GR5PC6DRDVSFJVTR35ABHANCNFSM4OSAS7RA.

brindakv commented 4 years ago

That data was loaded into the Deriva database in step 2. If users add / modify that data in steps 3 or 4, then the database is going to have the most up-to-date information than what was in the mmCIF file in step 2. The only information that is not in the database is the xyz coordinates, which needs to be taken from the mmCIF file in step 2.

svoinea commented 4 years ago

For the structure_id=PDBDEV_00000009, my script detected the following additional tables present in the PDBDEV_00000009.cif file:

    entry
    pdbx_database_status
    pdbx_audit_revision_history
    pdbx_audit_revision_details
    software
    pdbx_poly_seq_scheme
    ihm_starting_model_coord
    atom_site

The new export output is PDBDEV_00000009.txt So, my script is first writing the data of the tables from the deriva database, and then it selects from the PDBDEV_00000009.cif file the data of the tables that are NOT present the deriva database.

svoinea commented 4 years ago

I have set the entry table from the deriva database as the first table in the export output:

loop_
_entry.id
PDBDEV_00000009
#

I am uploading the zip file that contains the PDBDEV_00000009.cif file: PDBDEV_00000009.cif.zip

brindakv commented 4 years ago

The loop_ directive is not required for the entry table because by convention an mmCIF only contains one entry. There may be other software applications that assume this convention as a rule.

svoinea commented 4 years ago

The new file PDBDEV_00000009.cif.zip has the _entry table w/o the _loop directive and includes in the exported file only the rows of the mmCIF file that belong to the tables specified above.

brindakv commented 4 years ago

There are two consecutive loop_ directives on lines 1200/1201 and 3081/3082 in the latest mmCIF file. These lead to syntax errors while validating.

svoinea commented 4 years ago

Sorry for the inconvenience. I don't have yet the validator. I am re-uploading the latest result with the correction PDBDEV_00000009.cif.zip

brindakv commented 4 years ago

Thank you @svoinea. The latest mmCIF file validates.

brindakv commented 4 years ago

I tried validating mmCIF files for different entries. Some of these have data with single / double quotes. They all validate.

brindakv commented 3 years ago

mmCIF file generated for entry D_1-P1ZE does not validate with the mmCIF dictionary tool.

The error is in handling text which contains quotes in the ihm_modeling_protocol table.

svoinea commented 2 years ago

We detected that different kinds of double quotes characters (", “, ...), specific to different languages, are treated similar by the mmCIF validation tool. That breaks the rule:

 if the value does not contain `"` , then enquote the value by `"`

As a workaround, my script implements that rule for any double quotes characters.

svoinea commented 2 years ago

mmCIF allows using the . character for default values of columns. If the JSON file does not contain a column that:

- has the constraint `NOT NULL`,
- is not a primary or foreign key
- is of type `text`

then, the backend script will set the value . for it.

Currently, the following columns are candidates for it:

{
    "ihm_starting_model_details": [
        "starting_model_auth_asym_id"
    ],
    "pdbx_protein_info": [
        "name"
    ],
    "pdbx_inhibitor_info": [
        "name"
    ],
    "software": [
        "classification",
        "name"
    ],
    "ihm_dataset_related_db_reference": [
        "accession_code"
    ],
    "ihm_starting_comparative_models": [
        "starting_model_auth_asym_id",
        "template_auth_asym_id"
    ],
    "ihm_chemical_component_descriptor": [
        "auth_name"
    ],
    "audit_author": [
        "name"
    ],
    "struct_ref": [
        "db_code"
    ],
    "ihm_probe_list": [
        "probe_name"
    ],
    "ihm_external_reference_info": [
        "reference"
    ],
    "chem_comp_atom": [
        "type_symbol"
    ],
    "ihm_starting_model_seq_dif": [
        "db_asym_id",
        "db_comp_id"
    ],
    "pdbx_ion_info": [
        "name"
    ]
}

svoinea commented 2 years ago

The mmCIF has some columns of type ucode that allows you specify a text case insensitive value. If the column is also a foreign key, then it needs to be synchronized with the values from the reference table.

Currently, the backend script treats such cases only for the following vocabulary tables that have only the Yes/No values:

{
    "ihm_poly_residue_feature_interface_residue_flag": [
        "NO",
        "YES"
    ],
    "ihm_modeling_protocol_details_ensemble_flag": [
        "NO",
        "YES"
    ],
    "pseudo_site_flag": [
        "Yes",
        "No"
    ],
    "ihm_modeling_protocol_details_multi_state_flag": [
        "NO",
        "YES"
    ],
    "ihm_probe_list_reactive_probe_flag": [
        "no",
        "yes"
    ],
    "ihm_3dem_restraint_map_segment_flag": [
        "NO",
        "YES"
    ],
    "ihm_dataset_list_database_hosted": [
        "NO",
        "YES"
    ],
    "ihm_modeling_protocol_details_ordered_flag": [
        "NO",
        "YES"
    ],
    "ihm_2dem_class_average_restraint_image_segment_flag": [
        "NO",
        "YES"
    ],
    "ihm_modeling_protocol_details_multi_scale_flag": [
        "NO",
        "YES"
    ],
    "ihm_poly_probe_position_modification_flag": [
        "no",
        "yes"
    ],
    "ihm_poly_probe_position_mutation_flag": [
        "no",
        "yes"
    ],
    "sub_sample_flag": [
        "Yes",
        "No"
    ],
    "ihm_sas_restraint_profile_segment_flag": [
        "NO",
        "YES"
    ],
    "ihm_poly_probe_conjugate_ambiguous_stoichiometry_flag": [
        "no",
        "yes"
    ]
}

informatics-isi-edu / pdb-ihm

mmCIF file generation #41