Open merdivane opened 1 year ago
Hey team,
I have some exciting news to share regarding building our datasets. I came across an API that I believe will be extremely valuable for our data requirements. It's called the Proteins API, provided by the European Bioinformatics Institute (EBI).
The Proteins service provides an interface for accessing UniProtKB entries and UniProtKB isoform entries. The features service provides protein functional annotations from UniProt Knowledgebase (UniProtKB) protein entries. The variation, proteomics and antigen services provide annotations imported and mapped from large scale data sources, such as 1000 Genomes, ExAC (Exome Aggregation Consortium), ClinVar (Clinical significance of Variants), TCGA (The Cancer Genome Atlas), COSMIC (Catalogue Of Somatic Mutations In Cancer), TOPMed (Trans-Omics for Precision Medicine), gnomAD (Genome Aggregation Database), PeptideAtlas, MaxQB (MaxQuant DataBase), EPD (Encyclopedia of Proteome Dynamics), ProteomicsDB and HPA, along with UniProtKB annotations for these feature types (if applicable). And there is more.
You can find the documentation for the API here: Proteins API Documentation
Using this API, I've already started experimenting with Python requests and have successfully retrieved responses. Here's an example code snippet I've been working on:
import requests
import json
import pandas as pd
base_url = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=100"
params = {
"accession": "A0A1B0GTW7", # replace with the accession of the protein you are interested in
"format": "json"
}
response = requests.get(base_url, params=params)
if response.status_code == 200:
data = response.json()
else:
print(f"Request failed with status code {response.status_code}")
Some examples of what you can get:
We can use this API to build any dataset we want from (almost?) any source we want!
Let's put datasets here. In order to avoid confusion, the dataset refers to curated data from the database which we can use in ml and AI models. Database refers to Clinvar where they have raw data online but we need to work to get data and convert it to the dataset.
Dataset Card