markrobinsonuzh / CrispRVariants

23 stars 4 forks source link

Internal insertion sequence function? #9

Open rleenay opened 6 years ago

rleenay commented 6 years ago

Hello,

I am looking through some NHEJ data, and the crispr_set_list variable only lists the location of the insertion, but not the inserted sequence itself. plotVariants() is able to demonstrate what the insertion is with a key at the bottom of the plot. I was looking through the plotVariants() source code trying to pull out said function, but I'm having a bit of trouble. Is there a function that can display what the inserted nucleotide(s) is(are) from a crispr_set_list variable?

Thank you!

HLindsay commented 6 years ago

Hi,

There's a couple of ways of doing this, because I'm refactoring the variant counting code to allow more flexible counting methods. I'm actually trying to get an NHEJ pipeline finished for the May Bioconductor release, so I'm happy to discuss any specific requests you have here.

When you call the plotAlignments,CrisprSet-method, it calls CrisprSet$plotVariants to format the input, which then calls theplotAlignments,character-method. If you want to get all of the information that gets passed to plotAlignments,character-method, you can use plotAlignments(crispr_set, create.plot = FALSE) (where crispr_set is your CrisprSet object). This gives you a named list, including a data.frame of the insertion locations named ins.sites.

If you just want the table of insertions, it's one of the attributes of a CrisprSet object, you can access it with crispr_set$insertion_sites.

Finally, there's a fairly new not yet exported method called getInsertionsTable which works on any GAlignments object. I added this in January, I think it's only in the devel branch. This can be run like:

library(CrispRVarants)
data("gol_clutch1")
gol_alns <- alns(gol)
CrispRVariants:::getInsertionsTable(gol_alns[[2]])

In fairly new versions, the allele label that is used in plotting is stored as a metadata column in the alignments, which you can access using:

library(GenomicAlignments)
mcols(gol_alns[[2]])$allele

The difference between these options is that getInsertionsTable returns a data.frame of every insertion, whereas crispr_set$insertion_sites aggregates these by sequence. You can see the getInsertionsTable code in the active Github repo: https://github.com/HLindsay/CrispRVariants/blob/master/R/getInsertionsTable.R

Best wishes,

Helen

rleenay commented 6 years ago

So if I am understanding correctly, there is no current function that operates on existing crispr_set variables?

I ran crispr_set$insertion_sites but as you mentioned, this is just aggregated insertion sites which are difficult to separate into individual repair outcomes (such as those in plotVariants()).

I also ran getInsertionsTable successfully, but this only runs on the GAlignments class of variables. Does this not work on a CrisprSet? How are you calculating the numbers displayed on the right hand panel of the plotVariants() plot from the insertion_sites table? Is there an internal function that can convert the insertion_sites table into individual repair outcomes, a.k.a. generate the plotVariants() data but in table form?

Thank you again!

HLindsay commented 6 years ago

Hi,

Is there an internal function that can convert the insertion_sites table into individual repair outcomes, a.k.a. generate the plotVariants() data but in table form?

plotVariants is a layout function that arranges the left plot created by plotAlignments, the right plot created by plotFreqHeatmap, and optionally the top gene transcript plot. Both plotAlignments and plotFreqHeatmap have a create.plot argument. If you call these with a CrisprSet object and create.plot = FALSE you will get a list of the information that goes into the plots. You can also get just the table of variant counts / proportions that is used in the heatmap using the function variantCounts, which accepts the same filtering arguments as plotFreqHeatmap.

How are you calculating the numbers displayed on the right hand panel of the plotVariants() plot from the insertion_sites table?

The right hand panel doesn't use the insertion_sites table. The (current) counting procedure is to give all read sequences a tag based on the location and size of insertions or deletions, or mismatches when there are no indels. A CrisprSet object stores the aligned reads as a GAlignments object for each sample. You can access the alignments themselves with e.g. cs_alns <- alns(crispr_set). Then the allele tag each alignment receives is stored in the metadata of the GAlignments, e.g.

library(GenomicAlignments)
mcols(cs_alns[[1]])$allele

The data that goes into the right hand panel is created by counting the tags for each sample. At present sequences with the same indel size and location get the same tag.

I also ran getInsertionsTable successfully, but this only runs on the GAlignments class of variables. Does this not work on a CrisprSet?

No, but you could make a function to work on a CrisprSet, e.g. if you want a list of one table per sample:

getInsertions <- function(cset){
   cset_alns <- alns(cset)
   lapply(cset_alns, CrispRVariants:::getInsertionsTable)
}

Best wishes, Helen

rleenay commented 6 years ago

Helen,

Thank you for that detailed explanation, it really helped me understand how everything is being counted. Would it at all be possible to generate a create.plot = FALSE for the plotVariants() function?

Since these numbers are classified differently than the crispr_set$insertion_sites it would be extremely helpful to export these values in a table similar to plotAlignments. Creating another function to calculate these tags may end up creating a number of mis-counts. If not - that's definitely understandable!

Thank you!

HLindsay commented 6 years ago

Hi,

Adding a create.plot argument to plotVariants() isn't straightforward as some of the arguments for the allele plot (left) and heatmap (right) have the same names but different meanings. However, assuming you don't also need the arguments that are used to create the transcript plot at the top, the following function can be called in the same way as plotVariants and will give you the information needed to make the left and right plots.

 plotVariantsArgs <- function(cset, plotAlignments.args = list(), plotFreqHeatmap.args = list()){
    # Get the information used in the allele plot
    hm_args <- do.call(plotFreqHeatmap, c(obj = cset, create.plot = FALSE, plotFreqHeatmap.args))

    # Get the information used in the heatmap
    pa_args <- do.call(plotAlignments, c(obj = cset, create.plot = FALSE, plotAlignments.args))

    # Change names of list elements as some names are shared
    names(pa_args) <- paste0("left_", names(pa_args))
    names(hm_args) <- paste0("right_", names(hm_args))
    c(pa_args, hm_args)
+ }

This returns a list of all of the parameter values that are used in creating the plots. I've added "left" and "right" to the names of the list items belonging to the allele plot and heatmap so that they can be distinguished.

By the way, to create the plot starting from the argument list instead of a CrisprSet object, you can do for instance:

pa_args <- plotAlignments(cset, create.plot = FALSE)
# Modify pa_args as desired here
allele_plot <- do.call(plotAlignments, pa_args)
# Display plot
allele_plot

Then to put the plots together again, you could use the package "cowplot", but here's what CrispRVariants does internally:

pa_args <- plotAlignments(cset, create.plot = FALSE)
allele_plot <- do.call(plotAlignments, pa_args)

hm_args <- plotFreqHeatmap(cset, create.plot = FALSE)
heatmap_plot <- do.call(plotFreqHeatmap, hm_args)

# A blank plot is created if a transcript plot is not required
blank_plot <- grid::grid.rect(gp=grid::gpar(col="white"), draw = FALSE)
CrispRVariants:::arrangePlots(blank_plot, allele_plot, heatmap_plot)

Best wishes, Helen

amirmohan commented 6 years ago

Hi Helen,

I am also running into the same problem of finding the insertion sequences and their corresponding number of reads. I am looking at the output of the getInsertionsTable I believe. The first line of the file reads as: "start","seq","genomic_start","cigar_label","idxs","count","sample". I wanted to know what is the variable "idxs" and can it potentially help to find out the portion of reads that correspond to that particular insertion type? Any insight would be appreciated, thanks!

HLindsay commented 6 years ago

Hi,

I'm sorry for replying so late, I've been away and missed your question. The variable "idxs" is the index within the input alignments of the insertion. Using the CrispRVariants example data:

library("CrispRVariants")
data("gol_clutch1")

# Get the alignments for the fifth sample
gol_s5 <- alns(gol)[[5]]

# These are the labels that CrispRVariants would use in plotting
mcols(gol_s5)$allele
(Output)
[1] "-1:3I,5:21I" "-6:10D"      "-6:10D"      "-1:3I,5:21I" "2:24I"      
[6] "2:24I"       "2:24I"       "-6:10D"      "-1:3I,5:21I"

The first aligned read has two insertions - a 3 base pair and a 21 base pair insertion. You see that the first two rows of getInsertionsTable() match these insertions:

CrispRVariants:::getInsertionsTable(gol_s5)
(Output)
    start                      seq genomic_start idx
1 4647393                      GTC       4647392   1
2 4647398    CTCTCTTGGATCTCGCAGGAN       4647397   1
3 4647393                      GTC       4647392   4
4 4647398    CTCTCTTGGATCTCGCANGAN       4647397   4
5 4647395 NTNNNTCTCTCTTGGATCTCGCAG       4647394   5
6 4647395 CTTGGNNNNTCTTGGATCTCGCAN       4647394   6
7 4647395 CTTGGTCTCTCTTGGATCTCGCAG       4647394   7
8 4647393                      GTC       4647392   9
9 4647398    CTCTCTTGGATCTCGCAGGAN       4647397   9

getInsertionsTable relies on functions from the package GenomicAlignments, particularly cigarRangesAlongQuerySpace to get the positions of the insertions as an IRangesList and extractAt to get the sequences at these positions from the reads. I haven't exported getInsertionsTable as it's really preparing the data for plotting, but I'm happy to export it if it's something you need.

Best, Helen

amirmohan commented 6 years ago

Hi Helen,

Thanks for your reply. Now I understand what "idx". Unfortunately this does not seem to solve my problem. Let me explain my problem and see if we can find an easy fix. Lets say the outcome of the CRISPR is only 2 types of reads:

[1] "-1:1I" with seq='A', 500 reads [2] "-1:1I" with seq='T', 1000 reads

In other words the outcome is reads with insertions of length 1 at location -1 with sequences 'A' (500 reads) and 'T' (1000 reads). My problem is how to figure out the number of reads for each insertion type, i.e., 500 and 1000? It seems that the output of the CrispRVariants will only tell me the number of reads in total (1500) and not the splits. I hope this make sense.

Any help would be appreciated. Thanks!

Best, Amirali

On Tue, Apr 24, 2018 at 1:40 AM, HLindsay notifications@github.com wrote:

Hi,

I'm sorry for replying so late, I've been away and missed your question. The variable "idxs" is the index within the input alignments of the insertion. Using the CrispRVariants example data:

library("CrispRVariants") data("gol_clutch1")

Get the alignments for the fifth samplegol_s5 <- alns(gol)[[5]]

These are the labels that CrispRVariants would use in plotting

mcols(gol_s5)$allele

(Output) [1] "-1:3I,5:21I" "-6:10D" "-6:10D" "-1:3I,5:21I" "2:24I" [6] "2:24I" "2:24I" "-6:10D" "-1:3I,5:21I"

The first aligned read has two insertions - a 3 base pair and a 21 base pair insertion. You see that the first two rows of getInsertionsTable() match these insertions:

CrispRVariants:::getInsertionsTable(gol_s5)

(Output) start seq genomic_start idx 1 4647393 GTC 4647392 1 2 4647398 CTCTCTTGGATCTCGCAGGAN 4647397 1 3 4647393 GTC 4647392 4 4 4647398 CTCTCTTGGATCTCGCANGAN 4647397 4 5 4647395 NTNNNTCTCTCTTGGATCTCGCAG 4647394 5 6 4647395 CTTGGNNNNTCTTGGATCTCGCAN 4647394 6 7 4647395 CTTGGTCTCTCTTGGATCTCGCAG 4647394 7 8 4647393 GTC 4647392 9 9 4647398 CTCTCTTGGATCTCGCAGGAN 4647397 9

getInsertionsTable relies on functions from the package GenomicAlignments, particularly cigarRangesAlongQuerySpace to get the positions of the insertions as an IRangesList and extractAt to get the sequences at these positions from the reads. I haven't exported getInsertionsTable as it's really preparing the data for plotting, but I'm happy to export it if it's something you need.

Best, Helen

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markrobinsonuzh/CrispRVariants/issues/9#issuecomment-383851810, or mute the thread https://github.com/notifications/unsubscribe-auth/AO9zx4WoKC3cpktAYnMxJZiO4Kk_lknoks5truT6gaJpZM4SuYag .

-- Amirali Aghazadeh Electrical Engineering Department Stanford University

HLindsay commented 6 years ago

Hi Amirali,

CrisprSet objects contain a table of insertion_sites, e.g. using the "gol" data:

ins <- gol$insertion_sites
ins
(Output using v1.7.12)
   start                      seq genomic_start cigar_label    idxs count sample
2     22                      GTC       4647392 -1:3I,5:21I 1, 4, 9     3      5
8     22                      GTC       4647392 -1:3I,5:21I       2     1      6
1     24 CTTGGTCTCTCTTGGATCTCGCAG       4647394       2:24I       2     1      2
5     24 CTTGGNNNNTCTTGGATCTCGCAN       4647394       2:24I       6     1      5
6     24 CTTGGTCTCTCTTGGATCTCGCAG       4647394       2:24I       7     1      5
7     24 NTNNNTCTCTCTTGGATCTCGCAG       4647394       2:24I       5     1      5
10    24 CTTGGTCTCTCTTGGATCTCGCAG       4647394       2:24I       9     1      6
11    24 CTTGGTCTCTCTTGGATCTCGCAN       4647394       2:24I    4, 5     2      6
3     27    CTCTCTTGGATCTCGCAGGAN       4647397 -1:3I,5:21I    1, 9     2      5
4     27    CTCTCTTGGATCTCGCANGAN       4647397 -1:3I,5:21I       4     1      5
9     27    CTCCCTTGGATCTCGCAGGAN       4647397 -1:3I,5:21I       2     1      6

This is an aggregated version of the output from getInsertionsTable(), with columns giving the start with respect to the guide region, the inserted sequence, genomic coordinate, variant label, indices within each sample, number of occurrences and sample index. For example, for the allele "2:24I" in the 5th sample, the inserted sequences "CTTGGNNNNTCTTGGATCTCGCAN", "CTTGGTCTCTCTTGGATCTCGCAG" and"NTNNNTCTCTCTTGGATCTCGCAG" each occur in one read.

This insertion_sites table behaves slightly differently in earlier package releases. In the latest version 1.7.12, I've changed the idxs column to be consistent with the output from getInsertionsTable as the original interpretation (index within the table of insertions rather than within the aligned reads) is no longer needed. I've just pushed this version to Bioconductor devel, it will take about 24 hours to appear. You can also install it immediately from my Github site using the package devtools, or it will be in the new Bioconductor release which comes out next Tuesday.

Does this give you the information you need? If you are looking at combinations of insertions within reads it might be easier to work from the alignments rather than this table.

In the latest release it's also possible to manually change how reads get grouped into alleles and therefore have different insertion sequences appear separately in plots. This isn't documented yet. I'm trying to get the documentation ready in time for the Tuesday release.

Best, Helen

amirmohan commented 6 years ago

Hi Helen,

This was super helpful. I guess you have mentioned what we want in your last paragraph:

"In the latest release it's also possible to manually change how reads get grouped into alleles and therefore have different insertion sequences appear separately in plots"

We are looking forward to it. Just to clarify though, will this change be also applied to the output of readsToTarget() function, i.e., do we expect to get a crispr_set like this:

            Gene Name

no variant 3 -1:1I (T) 2 -1:1I (A) 5 6:1D 0 1:7I 1 2:1D,4:5I 0 Other 0

rather than

            Gene Name

no variant 3 -1:1I 7 6:1I 0 1:7I 1 2:1D,4:5I 0 Other 0

where the insertions at the same location are separated based on their sequence?

Thanks a lot!

Best, Amirali

On Wed, Apr 25, 2018 at 4:14 PM, HLindsay notifications@github.com wrote:

Hi Amirali,

CrisprSet objects contain a table of insertion_sites, e.g. using the "gol" data:

ins <- gol$insertion_sitesins

(Output using v1.7.12) start seq genomic_start cigar_label idxs count sample 2 22 GTC 4647392 -1:3I,5:21I 1, 4, 9 3 5 8 22 GTC 4647392 -1:3I,5:21I 2 1 6 1 24 CTTGGTCTCTCTTGGATCTCGCAG 4647394 2:24I 2 1 2 5 24 CTTGGNNNNTCTTGGATCTCGCAN 4647394 2:24I 6 1 5 6 24 CTTGGTCTCTCTTGGATCTCGCAG 4647394 2:24I 7 1 5 7 24 NTNNNTCTCTCTTGGATCTCGCAG 4647394 2:24I 5 1 5 10 24 CTTGGTCTCTCTTGGATCTCGCAG 4647394 2:24I 9 1 6 11 24 CTTGGTCTCTCTTGGATCTCGCAN 4647394 2:24I 4, 5 2 6 3 27 CTCTCTTGGATCTCGCAGGAN 4647397 -1:3I,5:21I 1, 9 2 5 4 27 CTCTCTTGGATCTCGCANGAN 4647397 -1:3I,5:21I 4 1 5 9 27 CTCCCTTGGATCTCGCAGGAN 4647397 -1:3I,5:21I 2 1 6

This is an aggregated version of the output from getInsertionsTable(), with columns giving the start with respect to the guide region, the inserted sequence, genomic coordinate, variant label, indices within each sample, number of occurrences and sample index. For example, for the allele "2:24I" in the 5th sample, the inserted sequences "CTTGGNNNNTCTTGGATCTCGCAN", "CTTGGTCTCTCTTGGATCTCGCAG" and"NTNNNTCTCTCTTGGATCTCGCAG" each occur in one read.

This insertion_sites table behaves slightly differently in earlier package releases. In the latest version 1.7.12, I've changed the idxs column to be consistent with the output from getInsertionsTable as the original interpretation (index within the table of insertions rather than within the aligned reads) is no longer needed. I've just pushed this version to Bioconductor devel, it will take about 24 hours to appear. You can also install it immediately from my Github site https://github.com/HLindsay/CrispRVariants using the package devtools, or it will be in the new Bioconductor release which comes out next Tuesday.

Does this give you the information you need? If you are looking at combinations of insertions within reads it might be easier to work from the alignments rather than this table.

In the latest release it's also possible to manually change how reads get grouped into alleles and therefore have different insertion sequences appear separately in plots. This isn't documented yet. I'm trying to get the documentation ready in time for the Tuesday release.

Best, Helen

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/markrobinsonuzh/CrispRVariants/issues/9#issuecomment-384462704, or mute the thread https://github.com/notifications/unsubscribe-auth/AO9zx8tvNjFxc9L1MAHk6NasOlwcy6Hzks5tsQNEgaJpZM4SuYag .

-- Amirali Aghazadeh Electrical Engineering Department Stanford University