Find Annotation and Knowledge Graphs to integrate

josiahseaman commented 4 years ago

Assignee: Ali Haider Bangash The first step is to identify what data could be integrated through a knowledge graph and what is available. What did the other Hackathon teams accomplish? What is available? Information goes in this issue. We're looking for information that relates to genetic variants of the virus:

Structural annotations channel. Protein structure => codon table => sequence position. We could mark up pangenome positions related to known protein variants
Gene Annotations: Possibly only need the reference gene annotation GFF, but it would be nice to have these positions in the graph genome context. @subwaystation has ensured we have coordinate transforms that go both ways pangenome <-> reference genome coordinates, using faldo in RDF.
Clinical data: Possibly the most important. If we have any knowledge of patient outcomes, and what region they're from, we could connect a strain of the virus (which will contain variants) to a patient outcome: how long in hospital, how long on ventilator, etc. We don't necessarily need a viral sequence from that specific individual, but at minimum a probable association with a variant.
- Human DNA variation data could also be used as in UK Biobank article.
- Technically, annotated a complete human pangenome is beyond our current scope in that gigabase genomes will put strain on our pipeline. It may be possible, however to make local graphs of key regions like HLA or MHC inside the Human genome.
Phylogenetics: We're going to have a phylogenetic tree eventually https://github.com/graph-genome/Schematize/issues/58. It'd be nice to link this with the "country" and "town" concepts in the knowledge graph. What geographic or transmission data could we bring in?

hhaider15 commented 4 years ago

Clinical data South Korea's CoVid 2019 patients 5 Year patient history The government of the Republic of Korea decided to share the world’s first de-identified COVID-19 nationwide patient data with domestic and international researchers. The data sets are collected and processed promptly, thanks to the Korean National Health Insurance System, covering the entire population across the nation.

hhaider15 commented 4 years ago

Structural annotations: Very well done by Machine learning working group- Complete genomes of the strains: labelled with the respective source & its metadata

hhaider15 commented 4 years ago

Gene annotations: whole genome nucleotide data pulled from RVDB release 14 as labels. Metadata for human & non-human pathogen phenotypes

hhaider15 commented 4 years ago

Structure annotations: Amino acid sequence data for common cold CoV and SARS-COV-2 for M, E & S proteins with metadata

hhaider15 commented 4 years ago

Genes & structural annotations: Proteomics data & MassIVE/CCMS Maestro+MSstats reanalysis of MSV000085096 / PXD017710 Proteome and Translatome of SARS-CoV-2 infected cells

subwaystation commented 4 years ago

Hi @hhaider15 ! Thanks for all the links. We could work with e.g. .csv or .fasta.

But what we had in mind are SparqlEndpoints which we could query using SPARQL.

I think a good start would be http://yummydata.org/. And maybe you will finde some endpoints which are not listed there ;) Please come back to me, if you have more questions.

subwaystation commented 4 years ago

@josiahseaman and Phylogenetics: As far as I got it from the #public_sequence_resource group, they will pack the metadata also into a SPARQL endpoint. Part of the metadata will be a mandatory field for collection_location. For the list of the required metadata please visit https://github.com/arvados/bh20-seq-resource/blob/master/example/minimal_example.yaml.

innamoratika commented 4 years ago

Ali- Just wanted to introduce myself post-convo with @josiahseaman : I'll be working on the phylo side of things and we should touch base at some point regarding using universal IDs for genomes. We should have enough in the phylo tree that we can track provenance and pass that on to you!

hhaider15 commented 4 years ago

Agreed. Apologies I was busy earlier. Shall be working on this, now.

hhaider15 commented 4 years ago

Good to see you @innamoratika

graph-genome / component_segmentation

Find Annotation and Knowledge Graphs to integrate #41