thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
579 stars 64 forks source link

plot multiple segments of a sequence in a single bin #101

Closed cgjosephlee closed 2 years ago

cgjosephlee commented 2 years ago

Let's say I want to compare (seqA:1-1000, seqA:100001-101000) to seqB:1-2000 and plot links. I know that a seq_id can only show up once in a bin. Is there any trick to do this instead of making a dummy seqA' ?

thackl commented 2 years ago

Good question! One way is to use focus(). It can zoom in on regions/loci, including multiple from the same parent sequence, essentially by creating dummy seqs internally.

Usually, it does so in an automated way, but you can also directly supply a data.frame with regions of interest (see below). Is this what you were aiming for?

library(tidyverse)
library(gggenomes)
library(devtools)
devtools::load_all("~/Code/projects/gggenomes")

seqs <- tribble(
  ~seq_id, ~length,
  "seqA", 120000, 
  "seqB", 120000
)

links <- tribble(
  ~seq_id, ~start, ~end, ~seq_id2, ~start2, ~end2,
  "seqA", 300, 500, "seqB", 400, 600,
  "seqA", 100300, 100500, "seqB", 400, 600,
)

# plot links as is
gggenomes(seqs=seqs, links=links) +
  geom_seq() +
  geom_link()

image

# Let's say I want to compare (seqA:1-1000, seqA:100001-101000) to seqB:1-2000
loci <- tribble(
  ~locus_id,~seq_id, ~start, ~end,
   "seqA_locus1", "seqA", 1, 1000,
   "seqA_locus2", "seqA", 100001, 101000,
   "seqB_locus1", "seqB", 1, 2000)

gggenomes(seqs=seqs, links=links) %>%
  focus(.loci=loci) +
  geom_seq_label() + geom_bin_label() +
  geom_seq() +
  geom_link() +
  NULL

image

gggenomes(seqs=seqs, links=links) %>%
  focus(.loci=loci, .locus_bin = "locus") + # make every locus a bin
  geom_seq_label() + geom_bin_label() +
  geom_seq() +
  geom_link() +
  NULL

image

cgjosephlee commented 2 years ago

Wow it's so clever. I will give it a try and thank you.

thackl commented 2 years ago

FYI, your question made me aware of a few related issues (#102,#103), which I just fixed. This means you can even simplify further. If you use manual loci, you don't need to provide a locus_id (will be computed automatically).

And if you want to zoom in on all links you have, you can now just call focus() without any arguments to automatically determine the interesting loci.

gggenomes(seqs=seqs, links=links) %>%
  focus() + # zoom in on all regions with links
  geom_seq_label() + geom_bin_label() +
  geom_seq() +
  geom_link()

image

cgjosephlee commented 2 years ago

Thank you so much. Finally have time to work this out. I came with a question: how do I assign locus_bin manually in focus(.locus_bin = "bin") option? I tried bin_id column name in loci df, but it seemed not working by warning Column `bin_id` doesn't exist.

thackl commented 2 years ago

I think I know what you mean but I'm not entirely sure. Could you post some example data to try it out?

cgjosephlee commented 2 years ago

Continue with the previous one:

loci <- tribble(
  ~locus_bin, ~locus_id, ~seq_id, ~start, ~end,
   "A", "seqA_locus1", "seqA", 1, 1000,
   "A", "seqA_locus2", "seqA", 100001, 101000,
   "B", "seqB_locus1", "seqB1", 1, 2000,
   "B", "seqB_locus2", "seqB2", 3000, 5000
)

and group the layout by locus_bin.

thackl commented 2 years ago

Ah, ok. For this to work, set the bin_id already when creating the seqs df, and not in the loci df.

seqs <- tribble(
  ~seq_id, ~length, ~bin_id,
  "seqA", 120000, "A",
  "seqB1", 60000, "B",
  "seqB2", 50000, "B"
)

links <- tribble(
  ~seq_id, ~start, ~end, ~seq_id2, ~start2, ~end2,
  "seqA", 300, 500, "seqB1", 400, 600,
  "seqA", 10300, 10500, "seqB1", 400, 600,
  "seqA", 10300, 10500, "seqB2", 400, 600,
  "seqA", 300, 500, "seqB2", 3400, 3600,
)

loci <- tribble(
  ~locus_id, ~seq_id, ~start, ~end,
  "seqA_locus1", "seqA", 1, 1000,
  "seqA_locus2", "seqA", 100001, 101000,
  "seqB1_locus1", "seqB1", 1, 2000,
  "seqB2_locus2", "seqB2", 3000, 5000
)

gggenomes(seqs=seqs, links=links) %>%
  focus(.loci=loci, .locus_bin="bin") +
  geom_seq_label() + geom_bin_label() +
  geom_seq() +
  geom_link() +
  NULL

image

cgjosephlee commented 2 years ago

Thanks for your quick reply. I am trying to generate multiple complex plots automatically involving like n species x m chromosomes x p regions with minimal manipulation on raw data (currently I read a batch of .fai and .bed for seqs and genes, respectively).

Here comes a more complex one 😂

loci <- tribble(
  ~locus_bin, ~locus_id, ~seq_id, ~start, ~end,
   "A", "seqA_locus1", "seqA", 1, 1000,
   "A", "seqA_locus2", "seqA", 100001, 101000,
   "B", "seqB_locus1", "seqB1", 1, 2000,
   "B", "seqB_locus2", "seqB2", 3000, 5000,
   "X", "seqA_locus3", "seqA", 2000, 5000,
   "X", "seqB_locus3", "seqB1", 2000, 5000
)

For my very edge case, ultimately, it would be easier if I can assign locus_bin in loci df. Then I can use same set of seqs and genes for multiple different plots.

thackl commented 2 years ago

Ah, I see. That's tricky. The problem, in this case, is that when you first create the gggenomes objects from seqs, genes and links, then genes and links "inherit" a bin_id from seqs (and if you did not provide a bin_id in seqs, it defaults to bin_id=seq_id). What you are trying to do is overwrite the bin_id in the focus step. That (if it where possible) would, however, only overwrite it for seqs. It would break the mapping to genes and links because their bin_ids would still be initial ones. And synchronizing bin_ids across those tracks during focus would be complicated. Especially, if as in your case, you want to assign genes from the same seq (which at the start have the same bin_id) to different bins based on their position...

The way I designed it, the seq_id is the central key that glues the datasets together. bin_id is just an extra grouping variable that is attached to every seq and everything derived from that seq. What's especially tricky in your example, is that you not only want to reassign seqs from one bin to another, but actually want to have loci from the same seq in different bins, which essentially means having the one seq in several bins at once...

Bottom line, I don't think there is a straightforward way of achieving what you want. Although, I don't think it's impossible. But you might have to take a slightly different approach. Would you be willing to share a bit more about your data, and how you are trying to construct the plot? How, for example, are you computing the locus_bin that you want to set in the end? If you aren't comfortable sharing that info here, we can talk via email (thackl@lim4.de)

cgjosephlee commented 2 years ago

Thank you for explaining the difficulties. My scenario is: I have several target genes. Each species may have 0 to n copies, and they may lie on same or different chromosomes. I would like to plot neighboring 20 genes (10 upstream and 10 downstream, so the length is variable. Maybe there are overlaps...) and investigate the conservity of gene synteny. The genes are colored or linked by ortholog groups. The rows are grouped by species.

Imagination: Like igv browser, I can read whole genome data at once, then I only need to "zoom on" (or clip) wanted regions.

The solution in my brain is: I can write scripts to generate the coordinates of all wanted regions in bed format. Read all genome (.fai), annotation (.gff or .bed) and cluster (orthology information in this case) files at once. By looping the region files I can generate multiple plots.

Some thoughts: What if creating dummy subsets for each locus? But the links might be the problem... I used to create copies at whole genome level (e.g. append _copy to every seq ID), as well as related annotations files, to avoid seq ID problem like this. In some cases, maybe concatenate all seqs into a giant dummy one. I can still read all the copies at once, but need to manually edit the region files.

Example: https://www.pnas.org/content/pnas/117/49/31267/F4.large.jpg