gosling-lang / gosling.js

Grammar of Scalable Linked Interactive Nucleotide Graphics
https://gosling.js.org
MIT License
158 stars 27 forks source link

Take a closer look at Clinvar VCF file #709

Open sehilyi opened 2 years ago

sehilyi commented 2 years ago

Looks like Clinvar VCF files are not working with the current VCF loader in Gosling. Might be related to the size of the data.

At a minimum, we need to enable specifying chromosome name conventions. The data we tested uses "chr" while the Clinvar uses "" w/o "chr" (e.g., Y). But, this does not seem to be the main issue since if I change the convention manually, Gosling still does not load any data.

The same data does not seem to work in JBrowse2 as well somehow.

Screen Shot 2022-05-27 at 17 27 04
sehilyi commented 2 years ago

Another dataset that does not work on Gosling and JBrowse2:

Screen Shot 2022-06-24 at 08 48 42

Update: Found two issues: (1) the chromosomes are not sorted, (2) the chromosome names do not use a "chr" prefix

sehilyi commented 2 years ago

Turns out that the Clinvar VCF file misses the chr prefix and Gosling was not handling this case well. If I set a custom assembly that excludes the prefix, Gosling correctly loads the data:

{
  "layout": "linear",
  "arrangement": "vertical",
  "centerRadius": 0.8,
  "assembly": [
    ["1", 248956422],
    ["2", 242193529],
    ["3", 198295559],
    ["4", 190214555],
    ["5", 181538259],
    ["6", 170805979],
    ["7", 159345973],
    ["8", 145138636],
    ["9", 138394717],
    ["10", 133797422],
    ["11", 135086622],
    ["12", 133275309],
    ["13", 114364328],
    ["14", 107043718],
    ["15", 101991189],
    ["16", 90338345],
    ["17", 83257441],
    ["18", 80373285],
    ["19", 58617616],
    ["20", 64444167],
    ["21", 46709983],
    ["22", 50818468],
    ["X", 156040895],
    ["Y", 57227415]],
  "xDomain": { "interval": [0, 10000]},
  "views": [
    {
      "tracks": [
        {
          "data": {
            "url": "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz",
            "type": "vcf",
            "indexUrl": "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi"
          },
          "mark": "point",
          "x": {"field": "POS", "type": "genomic"},
          "opacity": {"value": 0.9},
          "width": 600,
          "height": 130
        }
      ]
    }
  ]
}

Screen Shot 2022-08-04 at 15 59 54

Wonder if we can infer the chromosome name correctly (chr1 vs. 1). Perhaps, look into the header of the VCF file.

Also, to be able to visualize lollipop plots using this VCF file directly, we will need to enable parsing the INFO column.

// INFO value example
{"ALLELEID":[1493605],"CLNDISDB":["Human_Phenotype_Ontology:HP:0000090","Human_Phenotype_Ontology:HP:0004748","MONDO:MONDO:0019005","MedGen:C0687120","OMIM:PS256100","Orphanet:ORPHA655","SNOMED_CT:204958008"],"CLNDN":["Nephronophthisis"],"CLNHGVS":["NC_000001.11:g.5904754del"],"CLNREVSTAT":["criteria_provided","_single_submitter"],"CLNSIG":["Pathogenic"],"CLNVC":["Deletion"],"CLNVCSO":["SO:0000159"],"GENEINFO":["NPHP4:261734"],"MC":["SO:0001589\|frameshift_variant","SO:0001619\|non-coding_transcript_variant"],"ORIGIN":["1"]}

cc @manzt