下载TCGA所有癌症的maf文件做signature分析

下载TCGA所有癌症的maf文件做signature分析 by 生信技能树

才sanger研究所已经做好了这个分析，但是值得我们重复一下，效果如下：

TCGA所有癌症的mutation signature

首先TCGA所有癌症的maf文件

maf格式的mutation记录文件在TCGA里面已经是level4的数据啦，所以是完全open的，可以随意下载，只需要去其GDC官网简单点击，选择即可。

主要步骤就是在https://portal.gdc.cancer.gov/repository里面点击过滤文件类型，选择maf格式，再过滤access权限，选择open即可，最后得到的132个文件就是我们需要的。

总共是2.19GB的文件，每个癌症种类都有4种maf文件，分别是用mutect,muse,vanscan,somaticsniper这4款软件call 到的somatic mutation文件。

下载方式这里我选择下载它们132个文件的manifest文件，然后用GDC提供的官方工具来下载！关于这个工具，我在生信技能树论坛写过教程，就不多说了，自己去看哈，现在下载TCGA数据也是非常方便，首先是GDC网站及客户端就是安装成功后，运行 ./gdc-client download -m manifest_xxx.txt j即可。这个manifest文件就是自己刚才创造并且下载的。

cd ~/institute/TCGA/GDC_NCBI/all 
~/biosoft/GDC/gdc-client download -m gdc_manifest.2017-08-25T02-57-11.281090.txt

但是这个工具，提供的电脑操作系统版本有限哦If you are a user of CentOS 6 or RedHat Enterprise Release 6 and wish to use the Data Transfer Tool, contact the GDC Help Desk for assistance.

所以我是在MAC里面下载好了，再上传到我的服务器去的！

然后根据MAF文件制作signature

我是根据这篇文章Mutational Profile of Metastatic Breast Cancers: A Retrospective Analysis里面的方法来做的，他们的方法描述如下：

De novo mutational signature analysis was done using the Matlab Welcome Trust Sanger Institute’s signature framework. We used the deconstructSigs R package to determine the contribution of the known signatures that explain each sample mutational profile with more than 50 somatic mutations. We considered the 13 signatures (Signatures 1, 2, 3, 5, 6, 8, 10, 13, 17, 18, 20, 26, and 30) operative in breast cancer as defined in COSMIC (http://cancer.sanger.ac.uk/signatures/matrix.png). A signature was defined as operative or predominant if its contribution to the mutational pattern was respectively >25% (or >100 mutations) or >50%.

虽然我以前写过一个类似的教程，用SomaticSignatures包来解析maf突变数据获得mutation signature 这里还是再学习学习这个新的工具deconstructSigs R package吧。

这是个R包，所以直接在Rstudio里面安装即可，这里选取BRCA的somatic mutation的MAF文件做一下分析，看看四个软件找出的变异，是否在signature上面有差异。

install.packages('deconstructSigs')
# dependencies 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19' 
BiocInstaller::biocLite('BSgenome')
BiocInstaller::biocLite('BSgenome.Hsapiens.UCSC.hg19')
## https://github.com/raerose01/deconstructSigs
file1='TCGA.STAD.muse..somatic.maf.gz'
TCGA.STAD.muse=read.table(file1,sep = '\t',quote="",header = T)
TCGA.STAD.muse[1:5,1:15]
## data frame including 5 columns: sample.ID,chr,pos,ref,alt 
sample.mut.ref <- data.fram(Sample='TCGA.STAD.muse', 
                                chr = TCGA.STAD.muse[,5], 
                                pos = TCGA.STAD.muse[,6], 
                                ref = TCGA.STAD.muse[,11], 
                                alt = TCGA.STAD.muse[,13])
sigs.input <- mut.to.sigs.input(mut.ref = sample.mut.ref, 
                                sample.id = "Sample", 
                                chr = "chr", 
                                pos = "pos", 
                                ref = "ref", 
                                alt = "alt")
class(sigs.input)
sample_1 = whichSignatures(tumor.ref = sigs.input, 
                           signatures.ref = signatures.nature2013, 
                           sample.id = 'TCGA.STAD.muse', 
                           contexts.needed = TRUE,
                           tri.counts.method = 'exome')
# Plot example                  
plot_example <- whichSignatures(tumor.ref = sigs.input  ,  
                       signatures.ref = signatures.nature2013, 
                       sample.id = 'TCGA.STAD.muse' )
# Plot output
plotSignatures(plot_example, sub = 'example')

这个里面有一个问题，就是deconstructSigs R package似乎只支持hg19版本的基因组，而我下载的TCGA的MAF是hg38版本的，所以代码虽然是对的，但实际上做出的结果是不对的，需要把下载的TCGA的maf文件进行坐标转换。

而所谓批量，无非就是在上面的R脚本里面加入一个循环咯。

（点击阅读原文有这个包的详细说明书哈！）

注意事项，下载的MAF文件可能有两种格式，可能是47列，或者120列，第一行一般都是头文件，注释着每一列的信息，的确，信息量有点略大。如下：

     1    Hugo_Symbol
     2    Entrez_Gene_Id
     3    Center
     4    NCBI_Build
     5    Chromosome
     6    Start_Position
     7    End_Position
     8    Strand
     9    Consequence
    10    Variant_Classification
    11    Variant_Type
    12    Reference_Allele
    13    Tumor_Seq_Allele1
    14    Tumor_Seq_Allele2
    15    dbSNP_RS
    16    dbSNP_Val_Status
    17    Tumor_Sample_Barcode
    18    Matched_Norm_Sample_Barcode
    19    Match_Norm_Seq_Allele1
    20    Match_Norm_Seq_Allele2
    21    Tumor_Validation_Allele1
    22    Tumor_Validation_Allele2
    23    Match_Norm_Validation_Allele1
    24    Match_Norm_Validation_Allele2
    25    Verification_Status
    26    Validation_Status
    27    Mutation_Status
    28    Sequencing_Phase
    29    Sequence_Source
    30    Validation_Method
    31    Score
    32    BAM_File
    33    Sequencer
    34    t_ref_count
    35    t_alt_count
    36    n_ref_count
    37    n_alt_count
    38    HGVSc
    39    HGVSp
    40    HGVSp_Short
    41    Transcript_ID
    42    RefSeq
    43    Protein_position
    44    Codons
    45    Hotspot
    46    cDNA_change
    47    Amino_Acid_Change

     1    Hugo_Symbol
     2    Entrez_Gene_Id
     3    Center
     4    NCBI_Build
     5    Chromosome
     6    Start_Position
     7    End_Position
     8    Strand
     9    Variant_Classification
    10    Variant_Type
    11    Reference_Allele
    12    Tumor_Seq_Allele1
    13    Tumor_Seq_Allele2
    14    dbSNP_RS
    15    dbSNP_Val_Status
    16    Tumor_Sample_Barcode
    17    Matched_Norm_Sample_Barcode
    18    Match_Norm_Seq_Allele1
    19    Match_Norm_Seq_Allele2
    20    Tumor_Validation_Allele1
    21    Tumor_Validation_Allele2
    22    Match_Norm_Validation_Allele1
    23    Match_Norm_Validation_Allele2
    24    Verification_Status
    25    Validation_Status
    26    Mutation_Status
    27    Sequencing_Phase
    28    Sequence_Source
    29    Validation_Method
    30    Score
    31    BAM_File
    32    Sequencer
    33    Tumor_Sample_UUID
    34    Matched_Norm_Sample_UUID
    35    HGVSc
    36    HGVSp
    37    HGVSp_Short
    38    Transcript_ID
    39    Exon_Number
    40    t_depth
    41    t_ref_count
    42    t_alt_count
    43    n_depth
    44    n_ref_count
    45    n_alt_count
    46    all_effects
    47    Allele
    48    Gene
    49    Feature
    50    Feature_type
    51    One_Consequence
    52    Consequence
    53    cDNA_position
    54    CDS_position
    55    Protein_position
    56    Amino_acids
    57    Codons
    58    Existing_variation
    59    ALLELE_NUM
    60    DISTANCE
    61    TRANSCRIPT_STRAND
    62    SYMBOL
    63    SYMBOL_SOURCE
    64    HGNC_ID
    65    BIOTYPE
    66    CANONICAL
    67    CCDS
    68    ENSP
    69    SWISSPROT
    70    TREMBL
    71    UNIPARC
    72    RefSeq
    73    SIFT
    74    PolyPhen
    75    EXON
    76    INTRON
    77    DOMAINS
    78    GMAF
    79    AFR_MAF
    80    AMR_MAF
    81    ASN_MAF
    82    EAS_MAF
    83    EUR_MAF
    84    SAS_MAF
    85    AA_MAF
    86    EA_MAF
    87    CLIN_SIG
    88    SOMATIC
    89    PUBMED
    90    MOTIF_NAME
    91    MOTIF_POS
    92    HIGH_INF_POS
    93    MOTIF_SCORE_CHANGE
    94    IMPACT
    95    PICK
    96    VARIANT_CLASS
    97    TSL
    98    HGVS_OFFSET
    99    PHENO
   100    MINIMISED
   101    ExAC_AF
   102    ExAC_AF_Adj
   103    ExAC_AF_AFR
   104    ExAC_AF_AMR
   105    ExAC_AF_EAS
   106    ExAC_AF_FIN
   107    ExAC_AF_NFE
   108    ExAC_AF_OTH
   109    ExAC_AF_SAS
   110    GENE_PHENO
   111    FILTER
   112    CONTEXT
   113    src_vcf_id
   114    tumor_bam_uuid
   115    normal_bam_uuid
   116    case_id
   117    GDC_FILTER
   118    COSMIC
   119    MC3_Overlap
   120    GDC_Validation_Status

ixxmu / mp_duty