rtviii / ribosome.xyz-backend

2 stars 0 forks source link

Database Template

Refer to neo4j-steps.md for database creation.

The data that constitute the backbone of the database are found in /ribetl/resources/cumulativeData. This should be present in the neo4j's import folder when creating the database. Typically /var/lib/neo4j/import

Create the base-types, constraints of the database with scripts contained in this cypher .

ETL

When performing operations inside the neo4j directories, ex. import, act as neo4j the user to avoid permissions issues: sudo su neo4j

process the download in bulk by applying driver.ts in parallel to all the pdbidi in the download file: eliminate the commas with sed -i "s/\,/\\n/g" rcsb_pdb_ids_20210926175604.txt

Graph Profiles

ENS are in src/driver.ts

 dotenv.config({ path: '/home/rxz/dev/ribetl/.env' });

The API's response is transformed by the driver scripts to the form appropriate to the database injestion. For a given $RCSBID The result is stored in /static/$RCSBID/$RCSBID.json.4

Propagate any changes to the types to the INDUCT SCRIPTS, to the FRONT-END interface.

  1. We really value well-defined interfaces and the correspondence of types. Hence, they are specified in RibosomeTypes.ts . This is the basis for the Neo4j ontology as well as a guiding structure for the front-end's datatypes. If changes/additions are to be made changes/additions to the application, the ought to begin here.

  2. RCSB GraphQL endpoint is at https://data.rcsb.org/graphql. It is queried with the desired shaped for each molecule. Template query is in template_query . The response shape should conform to the types(see 1.).

  3. The resultant .json profile is used to initiate nodes and links in the database. For a given structure, the script creates individual components in sequence. Refer to cypher .

    Currently and roughly:

    1. (Merge)Create the structure node if one doesn't exist
    2. Find this struct, for each  contained protein -- create its node and connect to struct
    3. Connect proteins to PFAM Families
    4. Connect proteins to nomenclature classes
    5. Find this struct, for each  contained rna -- create its node and connect to struct
    6. Connect rnas to nomenclature classes
    7. Connect ligands

There is some ambiguity right now as to what to consider a Ligand. Some elongation factors land in the RP category because of their classification as a polypeptide.


Driver.ts can be applied to all structures in static like so:

parallel 'ts-node driver.ts -s {1}' ::: $(find ~/dev/riboxyzbackend/ribetl/static/ -type d    | awk -F '\/' '{print $8}' )

Structural Files

Actual RCSB structures are stored in batch_download folder (update on each cycle). Query is of the form: all ribosome structures, of resolution smaller than 4 angstrom deposited 2014.

  1. Each one has to have its chains renamed according to the new Ban nomenclature (assigned during graph profile generation). Scripts here . To be deposited at /static/$RCSBID/$RCSBID.cif

rename.py can be applied to all structures in static like so:

parallel 'python3 rename.py {1}' :::  $(find ~/dev/riboxyzbackend/ribetl/static/ -type d   | awk -F '\/' '{print $8}')
  1. Process the binding sites of ligands, elongation factors etc. (including whatever else gets included). Scripts . To be deposited at /static/$RCSBID/LIGAND_$LIGANDID*.json

  2. Split the structure into individual protien and rna to be deposited at /static/$RCSBID/CHAINS/$RCSBID_STRAND_$STRANID*.cif.

Todos

find . -name "*.json" | xargs grep 'rcsb_pdbx_description'  | awk -F  ':' ' $3 !~ /protein |RNA|rRNA|PROTEIN|Protein|mS|uL|UL|eL|bL|bS|BS|uS|eS|bS|mL|EL|rna|protein|ul|ml|RACK1/  {print $3}'