markrobinsonuzh / CrispRVariants

23 stars 4 forks source link

plotVariants top.n using INDELs only - missing rows #13

Closed axgraf closed 5 years ago

axgraf commented 5 years ago

Hi, I recently used the CrispRVariants package and everything worked as expected. I used the plotVariant function and limit the variants only to INDELs (include.nonindel = F) and left the default for top.n=50 for plotAlignment and plotFreqHeatmap.

plotVariants(crispr_set, 
        plotAlignments.args = list(include.nonindel = FALSE, top.n=50),
        plotFreqHeatmap.args = list( include.nonindel = FALSE, top.n=50),
        left.plot.margin = ggplot2::unit(c(0.1,0,11,0.2), "lines")
    )

I would now expect to get my first top 50 INDELs, but I only receive the reference sequence:

crispRvariant_top50

When I increase the top.n to let's say 500:

plotVariants(crispr_set, 
        plotAlignments.args = list(include.nonindel = FALSE, top.n=500),
        plotFreqHeatmap.args = list( include.nonindel = FALSE, top.n=500),
        left.plot.margin = ggplot2::unit(c(0.1,0,11,0.2), "lines")
    )

I will end up with all INDELs I would like to see.

crispRvariant_top1000

Looking at the variants the table shows, that the INDELs are sorted to the bottom of the matrix.

variants <- variantCounts(crispr_set)
tail(variants, n=20) # skipped sample 6-9 for better readablility in the screenshot

crispRvariant_tail

nrow(variants)  # 293

It seems that limiting to only INDELs will not (re-)sort the table accordingly, or is there anything I'm doing wrong?

Best Alex

HLindsay commented 5 years ago

Hi Alex,

This happens because when I wrote this code I decided that top.n = x, include.nonindel = FALSE means the indel variants from amongst the top x variant alleles, rather than the top x of the non-indel variants. That is, the filtering by top.n happens before the filtering by variant type.

It's possible to do what you want with a couple of extra steps. First, get the table of all indel variants and get the names of the variants you want to keep. Then, use the alleles argument of plotVariants:


vc <- variantCounts(crispr_set, include.nonindel = FALSE)
n_keep = min(50, nrow(vc))
allele_names <- rownames(vc)[1:n_keep]

plotVariants(crispr_set
    plotAlignments.args = list(alleles = allele_names),
    plotFreqHeatmap.args = list(alleles = allele_names),
    left.plot.margin = ggplot2::unit(c(0.1,0,11,0.2), "lines"))

The variantCounts table is sorted by row sums, so this will give you the top 50 indel alignments by total reads, possibly excluding the chimeras which are always reported in the last row as they can be a mixture of different types. If you prefer to order by variant proportion, you would need to sort the variantCounts table, e.g.

vc <- variantCounts(crispr_set, include.nonindel = FALSE, result = "proportions")
allele_names <- rownames(vc[order(rowSums(vc), decreasing = TRUE),])[1:n_keep]

Best, Helen

axgraf commented 5 years ago

Hi Helen, great explanation, thanks for clarifying!!! By the way, a really great tool to work with!

Best Alex