thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
605 stars 65 forks source link

Multiple coverage tracks supported? #184

Closed AlexSCFraser closed 6 months ago

AlexSCFraser commented 6 months ago

Hi,

I love the library but I'm trying to add a second base data track and not sure if I'm doing something wrong or if this isn't supported.

Basically I have a gene plot that works great and then I try to add two lines of base level coverage data like so:

p3 <- p2 %>% add_feats(base_data) %>% add_feats(base_data_2)
p4 <- p3 + geom_wiggle(aes(z = Depth, color="blue"), geom = "line", offset = -0.3, alpha=.75) +
      geom_wiggle(aes(z = Depth_2, color="red"), geom = "line", offset = -0.3, alpha=.75)

The result is that only the first coverage scores in the "Depth" variable get plotted, not the second set of scores in "Depth_2". I have tried using only one base_data bedgraph table in case only one feature table is supported, but that didn't seem to change anything. At this point I'm thinking maybe only a single geom_wiggle at a line is supported to draw lines? Is that correct?

Ideally I want to import the two seperate bed files (e.g. base_data and base_data_2) separately since the compressed version of the format does not force equal line counts in the tsv for the same genome, but that can be adjusted with bedtools if necessary. And then render multiple lines of coverage data offset from the genome track.

Worst case I can manually merge two independent plots together but it would be tedious and prone to alignment error, so I want to be sure this is an unsupported feature before manually mucking about with the graphs.

iimog commented 6 months ago

Hi, there are two potential issues here. The first one is, that color is provided with a fixed value inside aes. This way, both tracks end up with the same color, which is probably neither red nor blue. If you move color out of aes, you should see that there are indeed two lines. The second issue does not seem to make problems in your case, but if you add multiple features, it is good to explicitly specify, which one you want to use in the data attribute (e.g. data=feats(base_data)). See this reprex:

library(gggenomes)
emale_gc2 <- emale_gc %>% mutate(score2=score*(-.8)) %>% select(-score)
gggenomes(emale_genes, seqs=emale_seqs) %>%
  add_feats(emale_gc) %>%
  add_feats(egc) +
    geom_seq() +
    geom_wiggle(aes(z = score), geom = "line", offset = -0.3, alpha=.75, data=feats(emale_gc), color="blue") +
    geom_wiggle(aes(z = score2), geom = "line", offset = -0.3, alpha=.75, data=feats(egc), color="red")

It produces the following output image

AlexSCFraser commented 6 months ago

Thanks for the example! I was able to get my data tracks rendering well with this. I want to add a legend to label what substrate the different colours base level data represents, but the examples from ggplot documentation on setting manual aesthetics don't seem to be playing nice with gggenomes and neither are some other methods I found based on adding custom guide aesthetics.

This renders a plot with my data, similar to your example but with my gene features and annotations as well.

    p3 <- p2 |> add_feats(base_data) |> add_feats(base_data_2)
    p4 <- p3 + geom_coverage(aes(z = Depth), data=feats(base_data), geom = "line", offset = -0.3, color="blue", alpha=.75) +
      geom_coverage(aes(z = Depth_2), data=feats(base_data_2), geom = "line", offset = -0.3, color="red", alpha=.75)

But when I try and name the color aesthetic in a similar manner to how ggplot docs would add a legend for line colours:

    data_name_1 <- names(linecolors)[1]
    data_name_2 <- names(linecolors)[2]

    p3 <- p2 |> add_feats(base_data) |> add_feats(base_data_2)
    p4 <- p3 + geom_coverage(aes(z = Depth), data=feats(base_data), geom = "line", offset = -0.3, color=data_name_1, alpha=.75) +
      geom_coverage(aes(z = Depth_2), data=feats(base_data_2), geom = "line", offset = -0.3, color=data_name_1, alpha=.75) +
     scale_colour_manual("RNASeq Coverage", linecolors)

I get an error

! Problem while setting up geom aesthetics.
ℹ Error occurred in the 9th layer.
Caused by error in `del_from:length(app$styles)`:
! NA/NaN argument

Some of the explanations I could find online use the guides() function to modify scale_colour_manual, but this just ends up overwriting the gene feature colour legend with all red, or not adding a second legend at all.

Is there a small example of how to add a legend for the multiple coverage lines to a gggenomes plot?

iimog commented 6 months ago

I don't know what causes this issue in your case. This minimal example works for me:

library(gggenomes)
emale_gc2 <- emale_gc |> dplyr::mutate(score2=score*(-.8)) |> dplyr::select(-score)
gggenomes(emale_genes, seqs=emale_seqs) |>
  add_feats(emale_gc) |>
  add_feats(emale_gc2) +
    geom_seq() +
    geom_wiggle(aes(z = score, color="original"), geom = "line", offset = -0.3, alpha=.75, data=feats(emale_gc)) +
    geom_wiggle(aes(z = score2, color="changed"), geom = "line", offset = -0.3, alpha=.75, data=feats(emale_gc2)) +
    scale_color_manual("Coverage", values=c("original"="blue", "changed"="red"))

it produces this figure image

Maybe there is a problem if other geoms (e.g. genes or features) already use a color aesthetic, in this case ggnewscale might help (see https://github.com/thackl/gggenomes/issues/159#issuecomment-1750085923).

Also feel free to share a reproducible example if you still get errors.

AlexSCFraser commented 6 months ago

Thanks for the help, I figured out the problem, which seems to be that you can't construct a named list inplace inside the scale_colour_manual() call when the color names have to be pulled from a variable, and you also can't pass a named list to the values arg.

the relevant section of my code now looks like this:

    data_name_1 <- names(linecolors)[1]
    data_name_2 <- names(linecolors)[2]

    p3 <- p2 |> add_feats(base_data) |> add_feats(base_data_2)
    p4 <- p3 + 
      geom_coverage(aes(z = Depth, color=data_name_1), data=feats(base_data), geom = "line", offset = line_offset, alpha=.75) +
      geom_coverage(aes(z = Depth_2, color=data_name_2), data=feats(base_data_2), geom = "line", offset = line_offset, alpha=.75) +
      scale_color_manual("RNASeq Coverage", values=unlist(linecolors)) +
      guides(color = guide_legend(order = 2), 
             fill = guide_legend(order = 1))

So basically I just had to use unlist() to extract the values from a named list to a named vector and then I also used guides to explicitly set the legend order, since it randomly swapped for some reason. The error messages were entirely uninformative because the type errors seemed to cause gggenomes to have disordered track data and this caused it to fail, but this is a general R problem rather than anything wrong in gggenomes.

I do think it might be helpful to add your example of multiple coverage data with legend using the example dataset to the docs, since I assume others might have similar plots they want to make and it wasn't clear to me the correct syntax for doing so.

Thanks again for this great plotting package!