thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
572 stars 64 forks source link

Introducing sequence breaks? #161

Closed trnpl closed 9 months ago

trnpl commented 9 months ago

Hi! First, thanks for this excellent package -- it really generates beautiful plots and I appreciate that it works with eukaryotic datasets as well.

I am using this to generate some microsynteny plots for a locus I am working on. One of the odd features of this locus is that there is about ~10-15kb of non-coding region upstream of my GOI, which I would like to 'trim' so that I can make the useful information bigger. I am imagining a sequence break of at least 5kb or so in the middle of the 'contigs' that I show (see boxed parts of the screenshot below). I browsed the issues page and it seems like you were working on adding some tools to do this but I'm not sure if this particular use case has been implemented. I can likely hack something once I export the image as an svg but looking to hardcode as much of this as possible.

Screen Shot 2023-10-12 at 10 52 29 AM

Thanks for your help!!

thackl commented 9 months ago

Interesting case. Do you have some (dummy) data for a reproducible example that I can play with?

trnpl commented 9 months ago

Here is a file with info on the sequences and genes. Don't think links are really necessary here for this particular case.

seqsandgenes.zip

thackl commented 9 months ago

How about something like this.

library(gggenomes)

s0 <- read_csv("seqs.csv")
g0 <- read_csv("genes.csv")
g0

# full loci
p0 <- gggenomes(g0, s0) +
  geom_seq() +
  geom_gene(aes(fill=str_sub(feature_id, 1,3)))

# magic zoom in on "default" which mean all genes
p1 <- p0 |> focus(.expand = 1e3) 
p1

# decorate sequence breaks
geom_break <- function(mapping_start = NULL, mapping_end = NULL,
    data_start = seqs(start > 1), data_end=seqs(end < length),
    label="//", size=4, hjust=.75, family="bold", stat="identity",
    na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...){

  aes_start <- aes(x=x, y=y)
  aes_start <- gggenomes:::aes_intersect(mapping_start, aes_start)

  aes_end <- aes(x=xend, y=y)
  aes_end <- gggenomes:::aes_intersect(mapping_end, aes_end)

  list(
    geom_text(aes_start, data=data_start, label=label, size=size, hjust=hjust,
      family=family, ...),
    geom_text(aes_end, data=data_end, label=label, size=size, hjust=1-hjust,
      family=family, ...)
  )
}

p2 <- p1 + geom_break(label="/")
p2

library(patchwork)
p0 + p1 + p2 + plot_layout(guides="collect") & theme(legend.position = "bottom")
ggsave("break-at-long-non-coding.png", width=10, height=5)

break-at-long-non-coding

thackl commented 9 months ago

oh, and in case you want to flip things :)

p3 <- p0 |> 
  focus(.expand=1e3) |> 
  flip(where(~any(str_sub(.x$feature_id,1,3) == "hea" & .x$strand == "-")), .bin_track = genes)
p3

image

trnpl commented 9 months ago

Thanks! This I think is workable for my purposes :)

trnpl commented 9 months ago

Wait -- one more question actually. Is there a way to print the amount of nucleotides excluded in the breaks?

thackl commented 9 months ago

Two options: 1) locate() does the same as focus, but instead of cutting seqs in the end, it adds a track with the coordinates of the things that focus would cut out. From that table, you can get exact coordinates an compute what was excluded (see code below)

p4 <- p0 |> 
  locate(.expand = 1e3) +
  geom_feat(position = position_nudge(y=-.3))
p4$data$feats$loci

# A tibble: 30 × 15
       y     x  xend bin_id   seq_id start   end     i locus_score locus_id locus_length
   <int> <dbl> <dbl> <chr>    <chr>  <dbl> <dbl> <int>       <int> <glue>          <dbl>
 1    13     0 21286 Dano_sm… Dano_…     1 21286     1           7 Dano_sm…        22101
 2    13 37943 53399 Dano_sm… Dano_… 37944 53399     2           3 Dano_sm…        15456
 3    14     0 20276 Dpal_sm… Dpal_…     1 20276     1           7 Dpal_sm…        21108
 4    14 35778 51267 Dpal_sm… Dpal_… 35779 51267     2           3 Dpal_sm…        15489
 5    12   323 17521 Dpan_sm… Dpan_…   324 17521     1           3 Dpan_sm…        17198
 6    12 33051 54735 Dpan_sm… Dpan_… 33052 54735     2           6 Dpan_sm…        22462
 7    15     0 24170 dana_sm… dana_…     1 24170     1           8 dana_sm…        24985
 8    15 38128 53604 dana_sm… dana_… 38129 53604     2           3 dana_sm…        15476
 9    11     0 16792 datr_sm… datr_…     1 16792     1           5 datr_sm…        17663
10    11 29803 45264 datr_sm… datr_… 29804 45264     2           3 datr_sm…        15461

image

2) You can manually set the coordinates to focus() on via focus(.loci=<tibble(seq_id, start, end)>). With that you would know the sizes of gaps ahead of time.

# first 0-23k and 35-55k for first two seqs
loci <- tribble(
  ~seq_id, ~start, ~end,
  "dana_smallsynteny", 1, 23e3,
  "dana_smallsynteny", 35e3,55e3,
  "Dano_smallsynteny", 1, 23e3,
  "Dano_smallsynteny", 35e3,55e3
)

p5 <- p0 |> focus(.loci=loci) + geom_seq_label()
p5

image

trnpl commented 9 months ago

Extremely helpful, thank you so much!