Closed brindakv closed 2 years ago
For ;
,
;
have to be the beginning of the line? ;
?In the above example, I assume that the first ;*;
is for _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_code
and the second one is for _entity_poly.pdbx_seq_one_letter_code_entity_poly.pdbx_seq_one_letter_code_can
?
;
has to be in the beginning of the line. ;
;*;
is for _entity_poly.pdbx_seq_one_letter_code
and the second ;*;
is for _entity_poly.pdbx_seq_one_letter_code_can
@brindakv Please review the file
PDBDEV_00000009.txt which represents the output for structure_id=PDBDEV_00000009
.
The new file
PDBDEV_00000009.txt is sorting the rows based on the PK and has included at the bottom the PDBDEV_00000009.cif
file.
Some tables need to be appended from the mmCIF file
uploaded by the user:
atom_site
ihm_starting_model_coord
ihm_sphere_obj_site
ihm_gaussian_obj_site
ihm_gaussian_obj_ensemble
pdbx_poly_seq_scheme
pdbx_nonpoly_scheme
Not all entries will have all of the the above tables. Are you implying that we throw away data from mmCIF when we processed it in step2?
From: Brinda Vallat notifications@github.com Sent: Thursday, July 16, 2020 11:18 AM To: informatics-isi-edu/protein-database protein-database@noreply.github.com Cc: Hongsuda Tangmunarunkit hongsuda@isi.edu; Comment comment@noreply.github.com Subject: Re: [informatics-isi-edu/protein-database] mmCIF file generation (#41)
Some tables need to be appended from the mmCIF file uploaded by the user:
Not all entries will have all of the the above tables.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/protein-database/issues/41#issuecomment-659586246, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADFIDB4GR5PC6DRDVSFJVTR35ABHANCNFSM4OSAS7RA.
That data was loaded into the Deriva database in step 2. If users add / modify that data in steps 3 or 4, then the database is going to have the most up-to-date information than what was in the mmCIF
file in step 2. The only information that is not in the database is the xyz
coordinates, which needs to be taken from the mmCIF
file in step 2.
For the structure_id=PDBDEV_00000009
, my script detected the following additional tables present in the PDBDEV_00000009.cif
file:
entry
pdbx_database_status
pdbx_audit_revision_history
pdbx_audit_revision_details
software
pdbx_poly_seq_scheme
ihm_starting_model_coord
atom_site
The new export output is
PDBDEV_00000009.txt
So, my script is first writing the data of the tables from the deriva database, and then it selects from the PDBDEV_00000009.cif
file the data of the tables that are NOT
present the deriva database.
I have set the entry
table from the deriva database as the first table in the export output:
loop_
_entry.id
PDBDEV_00000009
#
I am uploading the zip file that contains the PDBDEV_00000009.cif
file:
PDBDEV_00000009.cif.zip
The loop_
directive is not required for the entry
table because by convention an mmCIF
only contains one entry. There may be other software applications that assume this convention as a rule.
The new file
PDBDEV_00000009.cif.zip has the _entry
table w/o the _loop
directive and includes in the exported file only the rows of the mmCIF
file that belong to the tables specified above.
There are two consecutive loop_
directives on lines 1200/1201 and 3081/3082 in the latest mmCIF
file. These lead to syntax errors while validating.
Sorry for the inconvenience. I don't have yet the validator. I am re-uploading the latest result with the correction PDBDEV_00000009.cif.zip
Thank you @svoinea. The latest mmCIF
file validates.
I tried validating mmCIF
files for different entries. Some of these have data with single / double quotes. They all validate.
mmCIF file generated for entry D_1-P1ZE
does not validate with the mmCIF dictionary tool.
The error is in handling text which contains quotes in the ihm_modeling_protocol
table.
We detected that different kinds of double quotes characters (", “, ...
), specific to different languages, are treated similar by the mmCIF validation tool.
That breaks the rule:
if the value does not contain `"` , then enquote the value by `"`
As a workaround, my script implements that rule for any double quotes characters.
mmCIF allows using the .
character for default values of columns.
If the JSON file does not contain a column that:
- has the constraint `NOT NULL`,
- is not a primary or foreign key
- is of type `text`
then, the backend script will set the value .
for it.
Currently, the following columns are candidates for it:
{
"ihm_starting_model_details": [
"starting_model_auth_asym_id"
],
"pdbx_protein_info": [
"name"
],
"pdbx_inhibitor_info": [
"name"
],
"software": [
"classification",
"name"
],
"ihm_dataset_related_db_reference": [
"accession_code"
],
"ihm_starting_comparative_models": [
"starting_model_auth_asym_id",
"template_auth_asym_id"
],
"ihm_chemical_component_descriptor": [
"auth_name"
],
"audit_author": [
"name"
],
"struct_ref": [
"db_code"
],
"ihm_probe_list": [
"probe_name"
],
"ihm_external_reference_info": [
"reference"
],
"chem_comp_atom": [
"type_symbol"
],
"ihm_starting_model_seq_dif": [
"db_asym_id",
"db_comp_id"
],
"pdbx_ion_info": [
"name"
]
}
The mmCIF has some columns of type ucode
that allows you specify a text case insensitive value.
If the column is also a foreign key, then it needs to be synchronized with the values from the reference table.
Currently, the backend script treats such cases only for the following vocabulary tables that have only the Yes/No
values:
{
"ihm_poly_residue_feature_interface_residue_flag": [
"NO",
"YES"
],
"ihm_modeling_protocol_details_ensemble_flag": [
"NO",
"YES"
],
"pseudo_site_flag": [
"Yes",
"No"
],
"ihm_modeling_protocol_details_multi_state_flag": [
"NO",
"YES"
],
"ihm_probe_list_reactive_probe_flag": [
"no",
"yes"
],
"ihm_3dem_restraint_map_segment_flag": [
"NO",
"YES"
],
"ihm_dataset_list_database_hosted": [
"NO",
"YES"
],
"ihm_modeling_protocol_details_ordered_flag": [
"NO",
"YES"
],
"ihm_2dem_class_average_restraint_image_segment_flag": [
"NO",
"YES"
],
"ihm_modeling_protocol_details_multi_scale_flag": [
"NO",
"YES"
],
"ihm_poly_probe_position_modification_flag": [
"no",
"yes"
],
"ihm_poly_probe_position_mutation_flag": [
"no",
"yes"
],
"sub_sample_flag": [
"Yes",
"No"
],
"ihm_sas_restraint_profile_segment_flag": [
"NO",
"YES"
],
"ihm_poly_probe_conjugate_ambiguous_stoichiometry_flag": [
"no",
"yes"
]
}
Requirements for extracting catalog data into
mmCIF
(to be used as an input to a separate pdb system)Data needs to be exported entry-wise i.e., only data belonging to a particular
entry
(as denoted byentry.id
) is to be exportedOnly tables and columns in the json schema need to be exported
Space or tabs can be used to separate column values in a row
If there is an optional column in a table and some rows in the table have values and some rows don't, then
.
can be used to denote missing valuesSingle or double quotes can be used for textual column values that contain spaces
Multi-line texts are enclosed within
;
(see example below)If the text contains single quotes, then they can be enclosed within double quotes and vice versa. If the text contains both single and double quotes, then they are enclosed within
;
like multi-line texts.;
e.g.#
can be used to add empty lines between tablesWhen vocab tables are used, the corresponding values should be used to populate the mmCIF tables (see
entity.type
in the example below).If a table returns zero rows for a particular
structure_id
, then the table need not be included in the mmCIF fileThe
structure_id
column in each table need not be included in the mmCIF fileDefault format for tables in
mmcif
:Some examples:
loop_ _entity.id _entity.type _entity.src_method _entity.pdbx_description _entity.formula_weight _entity.pdbx_number_ofmolecules 1 polymer man "C1q subunits A, C, and B" 45697.594 1 2 non-polymer man N-ACETYL-D-GLUCOSAMINE 221.208 1 3 non-polymer syn 'CALCIUM ION' . 1 # loop _entity_poly.entity_id _entity_poly.type _entity_poly.nstd_linkage _entity_poly.nstd_monomer _entity_poly.pdbx_seq_one_letter_code _entity_poly.pdbx_seq_one_letter_codecan 1 'polypeptide(L)' no no ;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS GFLLFPDMEA ; ;KDQPRPAFSAIRRNPPMGGNVVIFDTVITNQEEPYQNHSGRFVCTVPGYYYFTFQVLSQWEICLSIVSSSRGQVRRSLGF CDTTNKGLFQVVSGGMVLQLQQGDQVWVEKDPKKGHIYQGSEADSVFSGFLIFPSAGSGKQKFQSVFTVTRQTHQPPAPN SLIRFNAVLTNPQGDYDTSTGKFTCKVPGLYYFVYHASHTANLCVLLYRSGVKVVTFCGHTSKTNQVNSGGVLLRLQVGE EVWLAVNDYYDMVGIQGSDSVFSGFLLFPDGSAKATQKIAFSATRTINVPLRRDQTIRFDHVITNMNNNYEPRSGKFTCK VPGLYYFTYHASSRGNLCVNLMRGRERAQKVVTFCDYAYNTFQVTTGGMVLKLEQGENVFLQATDKNSLLGMEGANSIFS GFLLFPDMEA ; # loop _entity_poly_seq.entity_id _entity_poly_seq.num _entity_poly_seq.mon_id _entity_poly_seq.hetero 1 1 LYS n 1 2 ASP n 1 3 GLN n 1 4 PRO n 1 5 ARG n 1 6 PRO n 1 7 ALA n 1 8 PHE n 1 9 SER n 1 10 ALA n 1 11 ILE n 1 12 ARG n 1 13 ARG n 1 14 ASN n 1 15 PRO n #