hammerlab / biokepi

Bioinformatics Ketrew Pipelines
Apache License 2.0
27 stars 4 forks source link

COSMIC VCF for b37 is suspiciously small #223

Open ihodes opened 8 years ago

ihodes commented 8 years ago

Should have 2M variants /cc @iskandr

smondet commented 8 years ago

It comes from The Broad: https://github.com/hammerlab/biokepi/blob/master/src/lib/reference_genome.ml#L81

arahuja commented 8 years ago

It does, but it is an older version, from their help page:

For COSMIC however, it's more problematic as COSMIC doesn't release a VCF version (that I'm aware of, but please correct me if that's not right). Maintaining a converter for an external data source is something that we can't support right now so it doesn't get upgraded that frequently. However, the main purpose of the COSMIC VCF is rather slight.

Sanger manages a VCF now for release 76 with 3,222,429 coding mutations at http://cancer.sanger.ac.uk/cosmic/download, however, you need to register to download it

smondet commented 8 years ago

Where is the source? and what is the format? we could do the transformation ourselves.

iskandr commented 8 years ago

Download instructions from http://cancer.sanger.ac.uk/cosmic/download:

(I'm getting a TSV now so I can tell you what the columns are)


SFTP Download: /files/grch38/cosmic/v76/CosmicCompleteExport.tsv.gz

If you haven't done so already, you will need to register before you can download this file.

You will then need to select one of the two download methods listed below.

  1. GUI client The most user friendly method is to use a GUI client such as WinSCP, FileZilla or CyberDuck to connect to our SFTP sever. You will need to download the software, install it and consult the documentation before trying to download the file.

    The following credentials will be required to login.

   Host name:     sftp-cancer.sanger.ac.uk
   Protocol:      sftp
   Port:          22
   Username:      Your email address
   Password:      Your password   

Once logged in you will need to download the file from this location /files/grch38/cosmic/v76/CosmicCompleteExport.tsv.gz (depending on your web browser settings, clicking the link above should open your GUI client)

  1. SFTP from the Command Line This method is only recommended for those familiar with using the command line.

    To login, you will need to open a terminal window and use the following command (and enter your password when prompted). Note that the email address must be quoted. sftp "your_email_address"@sftp-cancer.sanger.ac.uk

    To download the file, use the following command
    sftp> get /files/grch38/cosmic/v76/CosmicCompleteExport.tsv.gz

    For more help using SFTP on the command line, type either the word 'help' or '?' in your terminal.

iskandr commented 8 years ago

The mutation TSV files look like this:

Gene name   Accession Number    Gene CDS length HGNC ID Sample name ID_sample   ID_tumour   Primary site    Site subtype 1  Site subtype 2  Site subtype 3  Primary histology   Histology subtype 1 Histology subtype 2 Histology subtype 3 Genome-wide screen  Mutation ID Mutation CDS    Mutation AA Mutation Description    Mutation zygosity   LOH GRCh    Mutation genome position    Mutation strand SNP FATHMM prediction   FATHMM score    Mutation somatic status Pubmed_PMID ID_STUDY    Sample source   Tumour origin   Age
PTPN11  ENST00000351677 1782    9644    910428  910428  827913  haematopoietic_and_lymphoid_tissue  NS  NS  NS  haematopoietic_neoplasm acute_myeloid_leukaemia NS  NS  n   COSM13101   c.? p.R289G Substitution - Missense     u                           Variant of unknown origin   15604238        blood-bone marrow   NS
JAK2    ENST00000381652 3399    6192    1104054 1104054 1018290 haematopoietic_and_lymphoid_tissue  NS  NS  NS  other   splanchnic_vein_thrombosis  NS  NS  n   COSM12600   c.1849G>T   p.V617F Substitution - Missense     u   38  9:5073770-5073770   +   n   PATHOGENIC  .94485  Reported in another cancer sample as somatic    18250227        blood-bone marrow   NS
PIK3CA  NM_006218.1 3207    8975    1124990 1124990 1038033 large_intestine NS  NS  NS  carcinoma   adenocarcinoma  NS  NS  n   COSM774 c.3139C>T   p.H1047Y    Substitution - Missense     u   38  3:179234296-179234296   +   n   PATHOGENIC  .94824  Reported in another cancer sample as somatic    18516290        surgery-fixed   primary
JAK2    ENST00000381652 3399    6192    1251736 1251736 1163188 haematopoietic_and_lymphoid_tissue  NS  NS  NS  haematopoietic_neoplasm myeloproliferative_neoplasm NS  NS  n   COSM12600   c.1849G>T   p.V617F Substitution - Missense     u   38  9:5073770-5073770   +   n   PATHOGENIC  .94485  Reported in another cancer sample as somatic    19074595        blood-bone marrow   NS
JAK2    ENST00000381652 3399    6192    1275775 1275775 1187073 haematopoietic_and_lymphoid_tissue  NS  NS  NS  haematopoietic_neoplasm myelodysplastic-myeloproliferative_neoplasm-unclassifiable  NS  NS  n   COSM12600   c.1849G>T   p.V617F Substitution - Missense hom u   38  9:5073770-5073770   +   n   PATHOGENIC  .94485  Reported in another cancer sample as somatic    17443220        blood-bone marrow   NS
IDH1    ENST00000345146 1245    5382    1333161 1333161 1243635 haematopoietic_and_lymphoid_tissue  NS  NS  NS  haematopoietic_neoplasm acute_myeloid_leukaemia NS  NS  n   COSM28746   c.395G>A    p.R132H Substitution - Missense het u   38  2:208248388-208248388   -   n   PATHOGENIC  .94085  Reported in another cancer sample as somatic    20368538        blood-bone marrow   NS
TP53    ENST00000269305 1182    11998   1362583 1362583 1272624 biliary_tract   gallbladder NS  NS  carcinoma   adenocarcinoma  NS  NS  n   COSM43617   c.? p.? Unknown     u                           Reported in another cancer sample as somatic    16177659        NS  NS
JAK2    ENST00000381652 3399    6192    1443351 1443351 1367596 haematopoietic_and_lymphoid_tissue  NS  NS  NS  haematopoietic_neoplasm polycythaemia_vera  NS  NS  n   COSM12600   c.1849G>T   p.V617F Substitution - Missense hom u   38  9:5073770-5073770   +   n   PATHOGENIC  .94485  Reported in another cancer sample as somatic    20422415        blood-bone marrow   NS
KRAS    ENST00000311936 567 6407    1504861 1504861 1427739 large_intestine NS  NS  NS  carcinoma   adenocarcinoma  NS  NS  n   COSM520 c.35G>T p.G12V  Substitution - Missense     u   38  12:25245350-25245350    -   n   PATHOGENIC  .98367  Reported in another cancer sample as somatic    21305640        surgery-fixed   NS

So, a total mess.