nghiavtr / SCmut

GNU General Public License v3.0
30 stars 9 forks source link

where to get cosmic file to call somatic mutations from hg38 #8

Open limin321 opened 1 year ago

limin321 commented 1 year ago

Hey,

Thanks for developing the SCmut tool. I am trying to use it on my own data, but my tumor- and normal- bulk bam file are all generated based on hg38. I try to use gatk mutect2 to call somatic mutation based on step 3 in the tutorial. However, I don't see mutect2 has the --cosmic argument on the webpage: https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2 ; Do I have to include cosmic file for calling somatic mutation? if Yes, Could you please navigate me where I can download the cosmic file for hg38 reference?

Would you mind providing an example code of how to generate the somatic mutations vcf using gatk Mutect2 to meet the requirements of SCmut input?

Thank you so much.

Best, LC

nghiavtr commented 1 year ago

Dear @limin321

Cosmic database provided you the list of curated known somatic mutations, which was one of parameters of the mutect version that we tested Scmut several years ago. This option aims to improve the performance of mutect. However, it is independent from the core implementation (2dFDR) of our tool Scmut. If the new version of mutect removed the option, we dont have to use Cosmic anymore. You also can use the output of varscan or any alternatives instead.

Best, Nghia

limin321 commented 1 year ago

Dear @limin321

Cosmic database provided you the list of curated known somatic mutations, which was one of parameters of the mutect version that we tested Scmut several years ago. This option aims to improve the performance of mutect. However, it is independent from the core implementation (2dFDR) of our tool Scmut. If the new version of mutect removed the option, we dont have to use Cosmic anymore. You also can use the output of varscan or any alternatives instead.

Best, Nghia

Hi Nghia, Thanks for the reply. I ran Mutect2 to call somatic mutation. I have somatic mutations from both chr1-chrY. Most somatic mutations are from unknown contigs. I saw your test data not including somatic mutations from contigs. Do you filter out those somatic mutations?

Another question is, I ran your example data. After running through the whole scripts. I manually add plot.new() in front of the points() to initialize the function. However, there is no figure generated. I don't know what outputs should I expect to see. Which are the final results? I need to know which cells have which somatic mutations.

Screenshot 2023-06-26 at 6 48 45 PM

Really appreciate your help.

Best,

LC

limin321 commented 1 year ago

After running step4 (https://github.com/nghiavtr/SCmut#4-variant-calling-of-multiple-files-from-both-rna-seq-and-dna-seq-data) with my own data, the raFull and rrFull only have 3 columns. Their data structure is different from the example data. Here is how my raFull and rrFull looks like:

Screenshot 2023-06-26 at 7 08 39 PM

Could you also explain what colnames mean in my data?

How can create the raFull and rrFull as shown in example data?

Thanks,

Best, LC

nghiavtr commented 1 year ago

Hi @limin321

"Do you filter out those somatic mutations?" ---> Yes, we usually keep the mutations in chr1-22, X, Y and M

"I manually add plot.new() in front of the points() to initialize the function. However, there is no figure generated. I don't know what outputs should I expect to see. Which are the final results? I need to know which cells have which somatic mutations." ---> you dont need to add plot.new() because the points() needs to add the points on the contour plotted in the previous steps. The codes for the example data can be run in a copy-paste style and at the end you can get the plot like the one at the bottom-right of Figure 1 in the SCmut paper (https://doi.org/10.1093/bioinformatics/btz288)

Best, Nghia

nghiavtr commented 1 year ago

Hi @limin321,

Each column should contain the counts of reference/variant alleles from a sample (a single cell or bulk-WES sample). Thus there must be something wrong, I guess it might be in the output format of the new version of mpileup tool. Can you put here few first lines of the output.snp.vcf file?

Nghia

After running step4 (https://github.com/nghiavtr/SCmut#4-variant-calling-of-multiple-files-from-both-rna-seq-and-dna-seq-data) with my own data, the raFull and rrFull only have 3 columns. Their data structure is different from the example data. Here is how my raFull and rrFull looks like: Screenshot 2023-06-26 at 7 08 39 PM

Could you also explain what colnames mean in my data?

How can create the raFull and rrFull as shown in example data?

Thanks,

Best, LC

limin321 commented 1 year ago

Hi @limin321,

Each column should contain the counts of reference/variant alleles from a sample (a single cell or bulk-WES sample). Thus there must be something wrong, I guess it might be in the output format of the new version of mpileup tool. Can you put here few first lines of the output.snp.vcf file?

Nghia

After running step4 (https://github.com/nghiavtr/SCmut#4-variant-calling-of-multiple-files-from-both-rna-seq-and-dna-seq-data) with my own data, the raFull and rrFull only have 3 columns. Their data structure is different from the example data. Here is how my raFull and rrFull looks like: Screenshot 2023-06-26 at 7 08 39 PM Could you also explain what colnames mean in my data? How can create the raFull and rrFull as shown in example data? Thanks, Best, LC

Hi, Here is my code to call SNPs from 3 bams:

fileList=("RNA4.bam" "tumor.bam" "normal.bam")
genomehg38Fasta="./hg38/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
snv_fn="output.snp.vcf"

samtools mpileup -f ${genomehg38Fasta} ${fileList[@]} | varscan mpileup2snp --min-coverage 5  --min-avg-qual 15 --min-var-freq 0.01 --p-value 1 > ${snv_fn}

Here is the first few lines of the output file: output.snp.vcf

Screenshot 2023-06-28 at 9 12 31 AM

I can also email you the output.snp.vcf if needed. Thank you so much for your help.

Best, LC

nghiavtr commented 1 year ago

hi @limin321 ,

Your input contains only three bam files ("RNA4.bam" "tumor.bam" "normal.bam"), so it makes sense that you have only three columns in the output. We expect one bam file for one single cell for the input of mpileup. Scmut was tested on single cell data generated by smartseq2 protocol, so sequencing data of each single cell will be stored in one sample (bam). In your case, if RNA4.bam contains the data of a pool of single cells, you need to divide the files into smaller bam files, each contains the data of one cell.

Nghia

Nghia

limin321 commented 1 year ago

hi @limin321 ,

Your input contains only three bam files ("RNA4.bam" "tumor.bam" "normal.bam"), so it makes sense that you have only three columns in the output. We expect one bam file for one single cell for the input of mpileup. Scmut was tested on single cell data generated by smartseq2 protocol, so sequencing data of each single cell will be stored in one sample (bam). In your case, if RNA4.bam contains the data of a pool of single cells, you need to divide the files into smaller bam files, each contains the data of one cell.

Nghia

Nghia

Hi, Thanks for replying.

Would you mind sharing the codes for how you get bam file for each cell? and how many bam files or cells do you have? I tried my data which has around 800 cells, and when running this code samtools mpileup -f ${genomehg38Fasta} ${fileList[@]} | varscan mpileup2snp --min-coverage 5 --min-avg-qual 15 --min-var-freq 0.01 --p-value 1 > ${snv_fn}, it ran 100 hours, and still failed because of Timeout. How long normally do you run, and how much ram you will suggest to use for ~800 bam files?

Thank you so much.

Best, LC

nghiavtr commented 1 year ago

Hi @limin321,

Sorry for the late reply due to my long vacation.

I just run SCmut with smartseq2 single-cell RNA-seq data. This protocol usually produces a small sample size (~100 cells), and each sample contains the data of one single cell, so I did not need to separate cells from a bam file. But you can google for a solution (I get you already got the solution)

I have never tried to run SCmut with 800 cells, so I am not sure about the Ram and CPUs requirements. However, I think to reduce the computational burden, you might do some preprocessing steps to remove noisy SNVs or cells, for example with very low counts.

If you send me your RData file for the scMut input, I will have a look to see if we can revise the codes to speed up SCmut.

Best, Nghia

Hi, Thanks for replying.

Would you mind sharing the codes for how you get bam file for each cell? and how many bam files or cells do you have? I tried my data which has around 800 cells, and when running this code samtools mpileup -f ${genomehg38Fasta} ${fileList[@]} | varscan mpileup2snp --min-coverage 5 --min-avg-qual 15 --min-var-freq 0.01 --p-value 1 > ${snv_fn}, it ran 100 hours, and still failed because of Timeout. How long normally do you run, and how much ram you will suggest to use for ~800 bam files?

Thank you so much.

Best, LC