svm-ai / svm-hackathon

5 stars 0 forks source link

ClinVar Card #20

Open SSU02 opened 1 year ago

SSU02 commented 1 year ago

Description of database

ClinVar is a public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.

Access (API or download)

https://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2018/clinvar_20180701.vcf.gz https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/ Also from GCP: bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20180701 Also like the authors of the huggingface space ESM-1b did: variant_summary https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

Data Type

Variant pathogenicity (annotation)

CLNDISDBINCL: MedGen, OMIM and Orphanet codes
CLNHGVS: HCVS mutation code CLNVC: type of mutation (e.g. Del, Insertion, SNP) CLNSIGINCL: clinical interpretation (i.e. label) options are Pathgenic, Likely Pathogenic, Benign, Likely benign, Unknown, Not Accessed CLNSIGCONF: conflicting clinical significance CLNVI : the variant's clinical sources reported as tag-value pairs of database and variant identifier (e.g. UNIPROT) GENEINFO: Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|) MC: comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence (e.g. missense)

Target metric

CLNSIGINCL: label of pathogenicity

Tried extracting some data

1)SELECT * FROM bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20180701 WHERE contains_substr(MC, 'missense_variant') AND contains_substr(CLNSIGINCL, 'athogenic') Retrieved 510 rows

2)Went through the data in clinvar.csv from huggingface esm1b project Data contains file_ID, variant, Clinical significance and allele ID (988571, 5) as shown in attached screenshot

Image

andreiarog commented 1 year ago

Can you check how many benign and pathogenic are in the clinvar.csv ? To see if they match the raw downloads

SSU02 commented 1 year ago

Hi, I tried to get a list of items in the clinvar.csv file and also get a count. I have the count for each category, but most of the significance seems to overlap. Please check the attached screenshot.

Screen Shot 2023-05-30 at 2 01 38 PM