t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

Are alleyoop rates and multiqc plots snp corrected? #120

Closed seb-mueller closed 5 months ago

seb-mueller commented 1 year ago

I'm currently struggling to work out true conversion rates of an experiment and some of the mutiqc reports (i.e. rates and UTR rates) show much higher than expected conversion rates even for the control samples.

Having investigated the cause, one likely reason is a high number of SNPs, which I also tried to call using alternative base caller such as freebayes, which seems to work based on inspection in a genome browser. However I noticed that most reports created by Alleyoop don't take a vcf file as input, specifically only the count module, but none of the others such as rates and utrrates.

I was therefore wondering if the the output of alleyoop rates/utrrates are SNP corrected at all, and if so, how to provide a custom vcf file to correct for custom SNPs?

Further, multiqc seems to be based on the rates/utrrates output above, but not on the (snp corrected) tsv files produced by count. I was therefore wondering if any mutiqc results are SNP corrected?

Apologies if I missed something obvious, but I couldn't find this info in the documentation and was wondering how to interpret those basic plots and results.

t-neumann commented 1 year ago

Hi @seb-mueller sorry for the long delay.

I dont have it now at the back of my had, but if there is no option to supply a SNP directory with -s or a vcf file, then for sure the output is not SNP corrected. Initially we had all these QC plots to check unbiased all conversion rates also due to sequencing errors etc, so we did this in a very raw way without applying any filters.

The best way to get SNP masked estimates would be to use the count files, but then you only get feature based numbers and not stratified by read.

seb-mueller commented 1 year ago

Hi @t-neumann , thanks for getting back on this one and confirming my suspicion. Just to provide some explicit confirmation for future readers, below are the calls for 3 of the alleyoop modules. None of them have a SNP directory of vcf file as input.

alleyoop rates [-h] -o <output directory> -r <reference fasta> [-mq <MQ cutoff>]
               [-t <threads>] bam [bam ...]

alleyoop tccontext [-h] -o <output directory> -r <reference fasta> [-mq <MQ cutoff>]
                   [-t <threads>] bam [bam ...]

alleyoop utrrates [-h] -o <output directory> [-r <reference fasta>] [-mq <MQ cutoff>] [-m]
                  [-t <threads>] -b <bed file> -l <maximum read length> bam [bam ...]

This also implicates that none of the multiqc results correct for snps since it seemingly relies on the 3 modules above as far as I can tell (I couldn't find any tcount files mentioned as input for multiqc):

https://github.com/ewels/MultiQC/blob/cdb412e0d2c447eee01ba705eaa9c179b3645f22/multiqc/modules/slamdunk/slamdunk.py#L74 (as well as line 86 and 102)

I suppose the multiqc report could benefit from mentioning this fact in the report since this is an important factor. I'd happy to submit a PR do so. In fact the reason I was wondering about this is our observation of quite a few SNPs biasing the conversion rate, so comparing the rates with/without SNP correction would be helpful to assess the impact (especially on a read level).

Anyway, good to know that working with tcount files is the only way for our purposes (or deriving rates in a costum way using samtools mpileup magic etc.)

t-neumann commented 1 year ago

Hi @seb-mueller

please by all means submit a PR, that would be very welcome. I guess why this was so far not an important factor for us is that we only compared samples within the same cell line and compared also then genes within the same genetic context, so any SNP induced biases would cancel themselves out.

But yeah requirements have changed over the years indeed, haven't worked too much on the code I admit....