tgen / CovGen

Creates a target specific exome_full192.coverage.txt file required by MutSig
MIT License
21 stars 9 forks source link

Confusion about covgen input files #9

Closed llrhys closed 4 years ago

llrhys commented 4 years ago
I have whole genome sequencing of 46 cancers. And I want to find significantly mutated genes using MutSigCV. When I using MutSigCV's "exome_full192.coverage.txt", there are lot of genes could not be mapped to coverage information and be exclueded("NOTE:  5178/16113 gene names could not be mapped to coverage information.  Excluding them"). So I want to create my own coverage files.
However, I don't know the file format to input CovGen. I convert a BAM file to bed format using bedtools. But I get many errors when I run CovGen. I have 46 BAM files,which bam files should be selected or I should merge them?I really confused, Please help me.
Attached to my bed file format and command  below.

command: /home/ll/CovGen/CovGen -o 504Coverage -f /home/ll/CovGen/my_own_files/Homo_sapiens.GRCh37.dna.primary_assembly.fa -g /home/ll/CovGen/my_own_files/Homo_sapiens.GRCh37.75.gtf -t ./input.bed -s /home/ll/snpEff/ -v 37.75 -p 7

bed file format

MT 0 88 CL100066573L2C007R049_258585/1 60 - MT 0 39 CL100066573L2C011R095_44911/1 17 + MT 0 86 CL100066573L2C001R037_96514/1 60 - MT 0 65 CL100066573L2C005R038_4405/2 51 + MT 0 82 CL100066573L2C009R068_67527/1 60 + MT 0 70 CL100066573L2C017R017_82594/1 3 + MT 0 71 CL100066573L2C017R017_82594/2 3 - MT 0 65 CL100066573L2C013R042_477368/1 48 + MT 0 86 CL100066573L2C011R003_329896/1 60 - MT 0 86 CL100066573L2C011R003_329896/2 60 + MT 0 56 CL100066573L2C016R003_447441/1 34 + MT 0 94 CL100066573L2C017R009_525239/2 60 + MT 0 82 CL100066573L2C008R092_68382/2 60 +

awchrist commented 4 years ago

Hi @llrhys aside from using CovGen to generate the custom coverage files I am seeing two road blocks from your post that are more specific to MutSig. The first one I think there is a work around for while the second does not.

MuSig was designed to work on exome/capture space and has hardcoded cutoffs that remove all noncoding variants among other things which defeats the entire purpose of MustSig.

To get around this you could filter your WGS variants to a capture space or maybe genes and there introns.

The second problem for your data set and MutSig is that the algorithms implemented in MutSig require many hundreds to thousands of samples.

When you say that you have WGS of 46 cancers does that mean you have variant calls from 46 samples or do you have many more samples that are from 46 different cancer types?

If it is the former then I would do a quick count of samples with a mutation in every gene. If no gene has more than 2 samples with a mutation then I can't imagine MutSig will call anything as significant.

llrhys commented 4 years ago

Hi @llrhys aside from using CovGen to generate the custom coverage files I am seeing two road blocks from your post that are more specific to MutSig. The first one I think there is a work around for while the second does not.

MuSig was designed to work on exome/capture space and has hardcoded cutoffs that remove all noncoding variants among other things which defeats the entire purpose of MustSig.

To get around this you could filter your WGS variants to a capture space or maybe genes and there introns.

The second problem for your data set and MutSig is that the algorithms implemented in MutSig require many hundreds to thousands of samples.

When you say that you have WGS of 46 cancers does that mean you have variant calls from 46 samples or do you have many more samples that are from 46 different cancer types?

If it is the former then I would do a quick count of samples with a mutation in every gene. If no gene has more than 2 samples with a mutation then I can't imagine MutSig will call anything as significant.

Thank you for your reply!

First, my samples are whole genome sequencing and I already removed non-exonic variants before.About ncRNA,should I keep or filter?

second,my samples included 36 keratoacanthomas(KA),13 cutaneous squamous cell carcinomas(cSCC.my mistake for 46,it is 49) and 17 normals.I divided the sample into kA and cSCC groups to do MutSigCV.Based on your suggestion, should I first count the proportion of each mutation in the sample, only keep those mutations that occur in more than 2 samples for MutSigCV? Or there is any other methods to look for driver mutations for my data?

awchrist commented 4 years ago

I would keep to protein coding exonic space only but I could be wrong about this as I have never attempted to start from variants called from WGS. I still think you do not have enough samples to make MutSig worth while. I suggested getting counts just as an exploratory exercise. I wouldn't filter using the counts at all.

Out of curiosity which gene has the most number of samples with a mutation in it? They don't have to be the same mutation.

Let's get back to how to use CovGen. Why did you convert the BAM files into BED format? The -t should be the target/exonic space you are planning to use in your analysis. Does the bed file use tab or space for the delimiter? and I would also exclude MT contigs?

The optional -b is used to filter out target space in an exome capture that is performing poorly across multiple samples. This option may not be needed with WGS but may help with poor mapping regions.

llrhys commented 4 years ago

I would keep to protein coding exonic space only but I could be wrong about this as I have never attempted to start from variants called from WGS. I still think you do not have enough samples to make MutSig worth while. I suggested getting counts just as an exploratory exercise. I wouldn't filter using the counts at all.

Out of curiosity which gene has the most number of samples with a mutation in it? They don't have to be the same mutation.

Let's get back to how to use CovGen. Why did you convert the BAM files into BED format? The -t should be the target/exonic space you are planning to use in your analysis. Does the bed file use tab or space for the delimiter? and I would also exclude MT contigs?

The optional -b is used to filter out target space in an exome capture that is performing poorly across multiple samples. This option may not be needed with WGS but may help with poor mapping regions.

Thanks for your suggestion and I will seriously considering not to use MutSigCV anymore.