Open SSU02 opened 1 year ago
Can you check how many benign and pathogenic are in the clinvar.csv ? To see if they match the raw downloads
Hi, I tried to get a list of items in the clinvar.csv file and also get a count. I have the count for each category, but most of the significance seems to overlap. Please check the attached screenshot.
Description of database
ClinVar is a public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.
Access (API or download)
https://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2018/clinvar_20180701.vcf.gz https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/ Also from GCP: bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20180701 Also like the authors of the huggingface space ESM-1b did: variant_summary https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
Data Type
Variant pathogenicity (annotation)
CLNDISDBINCL: MedGen, OMIM and Orphanet codes
CLNHGVS: HCVS mutation code CLNVC: type of mutation (e.g. Del, Insertion, SNP) CLNSIGINCL: clinical interpretation (i.e. label) options are Pathgenic, Likely Pathogenic, Benign, Likely benign, Unknown, Not Accessed CLNSIGCONF: conflicting clinical significance CLNVI : the variant's clinical sources reported as tag-value pairs of database and variant identifier (e.g. UNIPROT) GENEINFO: Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|) MC: comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence (e.g. missense)
Target metric
CLNSIGINCL: label of pathogenicity
Tried extracting some data
1)SELECT * FROM
bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20180701
WHERE contains_substr(MC, 'missense_variant') AND contains_substr(CLNSIGINCL, 'athogenic') Retrieved 510 rows2)Went through the data in clinvar.csv from huggingface esm1b project Data contains file_ID, variant, Clinical significance and allele ID (988571, 5) as shown in attached screenshot