nghiavtr / SCmut

GNU General Public License v3.0
31 stars 9 forks source link

Regarding "Variant calling of multiple files". Incredibly slow. And too many variants? #1

Closed ahwanpandey closed 5 years ago

ahwanpandey commented 5 years ago

Hello,

I am using the following command to call the variants. My intention here is to speed up this process while retaining all the variant information that is needed.

samtools mpileup 
   --min-MQ 10 \
   --min-BQ 13 \
   --region $CHR \
   -f $GENOME_FASTA \
   --bam-list $BAM_FILES_LIST | \
java -jar VarScan.v2.4.3.jar mpileup2snp \
   --min-coverage 5 \
   --min-avg-qual 15 \
   --min-var-freq 0.01 \
   --p-value 1 \
   --vcf-sample-list $SAMPLE_NAMES \
   --output-vcf 1 > mpileup2snp.$CHR.vcf

A couple of changes I made was to include the "--min-MQ 10" parameter to mpileup because it was outputting every region even if there were reads with a MAPQ of 0.

Another change was to call variants per chromosome.

The data that I am generating this is on:

tumor.wgs -> N=1 normal.wgs -> N=1 tumor.rnaseq -> N=1 tumor.scrnaseq -> N=437

So you see there are 440 BAM files and I can understand the variant calling would be slow. It has been almost 2 days since the variant calling has been running and the output files already have 2,225,406 variants, which will only increase as the data hasn't even been processed half way. The example input data for SCmut has 1,253,869 variants. So I am wondering if I am calling too many variants and there is more filtering to be had?

I would love some feedback regarding this!

Thanks.

nghiavtr commented 5 years ago

Hello,

Thank you for interesting in SCmut.

I understand that you are worried about too many variants when calling from the data. It is easy to get too many variants from a big dataset with 440 samples. We were successful to test SCmut with a data consisting of 2,605,510 variants from 96 single cells, however, your data can have a bigger variant set when the calling is completed. In my opinion, the big number of variants is not the problem, as long as, your computer has enough RAM and storage to process data.

My suggestion is that you would consider increasing the thresholds in mpileup, for example, min-var-freq (Minimum variant allele frequency threshold). Thus I guess the number of variants can substantially decrease. This, of course, depends on your research question if you do not care much about the very low VAF variants.

I hope my response would help and good luck with your research.

Best, Nghia

ahwanpandey commented 5 years ago

Hi @nghiavtr

Thanks fro you reply and for sharing this tool!

So far, using my parameters above there are about 7 million SNPs reported in the mpileup2snp file. I will keep your suggestion in mind and try to filter these before running SCmut if necessary.

Thanks!

nghiavtr commented 5 years ago

You are welcome! Do not hesitate to post your question here if there are any issues with using SCmut. Good luck with your research.

Best, Nghia

ahwanpandey commented 5 years ago

Hi again.

I have already generated the mpileup2snps file and it would take me a long time to re-do with a different "min-var-freq" threshold.

How would I go about filtering this from the raFull and rrFull counts table that I have? I couldn't filter using just the tumor sample column can I? Or does this threshold have to be set for all samples? Like calculate allele frequency across all tumor wgs, normal wgs, and scrnaseq data and exclude rows that are less than say 0.1?

I guess as a main classification I wanted is, what does the raFull and rrFull actually have to consist of?

Really appreciate any help!

nghiavtr commented 5 years ago

Hi ahwanpandey,

It is likely that you get some issues with the data of raFull and rrFull objects. So I put a few lines of R codes at the end of Section 4 to extract these objects from output.snp.vcf in the previous step. Since your data are too big, so it might take time. Sorry for that, I would improve the codes for speed in the future.

About filtering the variants, if I understand correctly, you want to do the filter before inputting to the cell-level mutation detection of SCmut. If so I think SCmut can run with 7M variants from your data, so it is not necessary to do that for the input of SCmut. You might do the filter with a certain VAF for the output of SCmut.

Hope my comments would help.

Best, Nghia