svm-ai / svm-hackathon

5 stars 0 forks source link

Dataset Card #14

Open merdivane opened 1 year ago

merdivane commented 1 year ago

Let's put datasets here. In order to avoid confusion, the dataset refers to curated data from the database which we can use in ml and AI models. Database refers to Clinvar where they have raw data online but we need to work to get data and convert it to the dataset.

Dataset Card

ecuracosta commented 1 year ago

Hey team,

I have some exciting news to share regarding building our datasets. I came across an API that I believe will be extremely valuable for our data requirements. It's called the Proteins API, provided by the European Bioinformatics Institute (EBI).

The Proteins service provides an interface for accessing UniProtKB entries and UniProtKB isoform entries. The features service provides protein functional annotations from UniProt Knowledgebase (UniProtKB) protein entries. The variation, proteomics and antigen services provide annotations imported and mapped from large scale data sources, such as 1000 Genomes, ExAC (Exome Aggregation Consortium), ClinVar (Clinical significance of Variants), TCGA (The Cancer Genome Atlas), COSMIC (Catalogue Of Somatic Mutations In Cancer), TOPMed (Trans-Omics for Precision Medicine), gnomAD (Genome Aggregation Database), PeptideAtlas, MaxQB (MaxQuant DataBase), EPD (Encyclopedia of Proteome Dynamics), ProteomicsDB and HPA, along with UniProtKB annotations for these feature types (if applicable). And there is more.

You can find the documentation for the API here: Proteins API Documentation

Using this API, I've already started experimenting with Python requests and have successfully retrieved responses. Here's an example code snippet I've been working on:

import requests
import json
import pandas as pd

base_url = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=100"
params = {
    "accession": "A0A1B0GTW7",  # replace with the accession of the protein you are interested in
    "format": "json"
}

response = requests.get(base_url, params=params)

if response.status_code == 200:
    data = response.json()
else:
    print(f"Request failed with status code {response.status_code}")

Some examples of what you can get:

  1. Inside "comments" you have "FUNCTION" where you can read: 'Putative metalloproteinase that plays a role in left-right patterning process'
  2. Also inside "comments" you have "DISEASE" where you can read: 'Heterotaxy, visceral, 12, autosomal' or a longer form 'A form of visceral heterotaxy, a complex disorder due to disruption of the normal left-right asymmetry of the thoracoabdominal organs. Visceral heterotaxy or situs ambiguus results in randomization of the placement of visceral organs, including the heart, lungs, liver, spleen, and stomach. The organs are oriented randomly with respect to the left-right axis and with respect to one another. It can be associated with a variety of congenital defects including cardiac malformations. Early death may occur. HTX12 inheritance is autosomal recessive.'
  3. You can also found sequence alterations leading to disease inside "features"
  4. References
  5. WT sequence
  6. And a HUGE amount of data from all the databases listed above...

We can use this API to build any dataset we want from (almost?) any source we want!