thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
581 stars 64 forks source link

Synteny plot from several segments to one specific segment in the plot #69

Closed StromTroopers closed 3 years ago

StromTroopers commented 3 years ago

Hello, first of all I would like to thank you for this package (I find a lack of packages that allow to really plot genomic data in synteny).

I am writing this post to seek some help in the particular use of gggenomes. Indeed in my data I need to realize a rather simple but particular figure.

Here are my data:

> Segments
   Seq_id Name Start  End Strand
1      S1   G1   300  330     -1
2      S1   G2    40   65      1
3      S2   G3  1500 1530     -1
4      S3  A10   170  230     -1
5      S3   G5   600  630      1
6      S3   G6  2600 2670      1
7   Virus   G1    20   50      1
8   Virus   G2    70  110      1
9   Virus   G3   250  270     -1
10  Virus   G4   400  500     -1
11  Virus   G5   800  830      1
12  Virus   G6   900  950      1

With this dataframe I managed with the ggenes package to create the following plot :

Capture d’écran 2021-08-20 à 09 56 37

And then I would need to add synteny links between the Sn scaffolds and their corresponding Name in the Virus scaffold and get something like this :

Capture d’écran 2021-08-20 à 10 00 14

I thing ggenomes cand do that but I did not find a way to do it this way ... Can you help me to better understand the code please ? Thanks a lot for your help.

Here is the dput table format :

structure(list(Seq_id = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 
4L, 4L, 4L, 4L, 4L), .Label = c("S1", "S2", "S3", "Virus"), class = "factor"), 
    Name = structure(c(2L, 3L, 4L, 1L, 6L, 7L, 2L, 3L, 4L, 5L, 
    6L, 7L), .Label = c("A10", "G1", "G2", "G3", "G4", "G5", 
    "G6"), class = "factor"), Start = c(300L, 40L, 1500L, 170L, 
    600L, 2600L, 20L, 70L, 250L, 400L, 800L, 900L), End = c(330L, 
    65L, 1530L, 230L, 630L, 2670L, 50L, 110L, 270L, 500L, 830L, 
    950L), Strand = c(-1L, 1L, -1L, -1L, 1L, 1L, 1L, 1L, -1L, 
    -1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -12L
))
thackl commented 3 years ago

Thanks for reaching out, happy help!

d0 <- structure(
  list(
    Seq_id = structure(
      c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L),
      .Label = c("S1", "S2", "S3", "Virus"),
      class = "factor"
    ),
    Name = structure(
      c(2L, 3L, 4L, 1L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L),
      .Label = c("A10", "G1", "G2", "G3", "G4", "G5", "G6"),
      class = "factor"
    ),
    Start = c(300L, 40L, 1500L, 170L, 600L, 2600L, 20L, 70L, 250L, 400L, 800L, 900L),
    End = c(330L, 65L, 1530L, 230L, 630L, 2670L, 50L, 110L, 270L, 500L, 830L,950L),
    Strand = c(-1L, 1L, -1L, -1L, 1L, 1L, 1L, 1L, -1L, -1L, 1L, 1L)),
    class = "data.frame", row.names = c(NA, -12L));

# Convert to recognized column names (lower case)
# https://thackl.github.io/gggenomes/articles/gggenomes.html#inside-gggenomes-tracks
d1 <- rename_with(d0, str_to_lower)                                                                                                  
gg1 <- gggenomes(d1) + geom_seq() + geom_gene(aes(fill=name))
gg1

image

Note, without seq info, gggenomes zooms in on just the genes, i.e. trims seqs before/after the first/last gene. You can prevent that trimming in front with infer_start=0, but the right length of sequences can obviously not be guessed from the gene coordinates alone.

gg2 <- gggenomes(d1, infer_start=0) + geom_seq() + geom_gene(aes(fill=name))
gg2

image

The easiest way to connect genes is to `add_clusters()` and draw the connections with geom_link(). All you need is a simple data.frame with two columns (cluster_id, feat_id). Since your example genes did not have a unique ID column, but their names seem to represent their "clusters", we can use the automatically assigned gene IDs from gggenomes to create the clusters data.frame.

clusters <- pull_genes(gg2) %>% select(cluster_id=name, feat_id)
# A tibble: 12 x 2
   cluster_id feat_id
   <chr>      <chr>  
 1 G1         f1     
 2 G2         f2     
 3 G3         f3     
 4 A10        f4     
 5 G5         f5     
 6 G6         f6     
 7 G1         f7     
 8 G2         f8     
 9 G3         f9     
10 G4         f10    
11 G5         f11    
12 G6         f12 
gg3 <- gg2 %>% add_clusters(clusters) + geom_link()
gg3

image

As you may notice, this only draws links between adjacent genomes, i.e. not from the top to the bottom genomes. That's gggenomes default behavior because for more complex datasets drawing all links quickly becomes very slow and very messy. In theory, drawing all links is possible, but trying it out just now, I noticed that I haven't properly implemented that yet (#70).

Note also, that if you rearrange genomes, the links will be updated to again show all links between adjacent genomes.

gg3 %>% pick(1,4,3,2)

image

Let me know if you need more info are have additional questions! Also, do you think being able to draw all links is an important feature?

StromTroopers commented 3 years ago

Wow thank you for the rapidity ! Ok I see, so what if instead of drawing all links in different rows we could only draw something in the same axe such as that for instance :

Capture d’écran 2021-08-20 à 14 48 20

Maybe it would be better right ? Do you think it is still possible with ggenomes?

thackl commented 3 years ago

Ah, didn't get that right away. Absolutely. One of the reasons I wrote gggenomes is to support multiple contigs per genome. To group seqs (contigs/chromosomes) into genomes, you need to add one more variable bin_id. "bin" here is short for genome, inspired by metagenomic bins.

If bin_ids aren't provided, as in your dataset, gggenomes just sets bin_id=seq_id, i.e. treats every sequence as a different genome. A somewhat hacky but quick way for your data is to change how gggenomes computes the bin_id from the seq_id. That can be done by providing an expression to infer_bin_id. In your case, we can just use only the first character. You would also get the same result by adding an additional bin_id column to your gene table.

# hacky
gg4 <- gggenomes(d1, infer_bin_id = str_sub(seq_id, 1, 1)) %>%
  add_clusters(clusters) +
  geom_seq() + geom_gene(aes(fill=name)) + geom_link() +
  geom_bin_label()

# equivalent
d2 <- mutate(d1, bin_id = str_sub(seq_id, 1, 1))
gg4 <- gggenomes(d2) %>%
  add_clusters(clusters) +
  geom_seq() + geom_gene(aes(fill=name)) + geom_link() +
  geom_bin_label()

gg4

image

What is missing are the nice ticks you have at the end of the sequences to indicate that they have been trimmed. I thought about adding such a geom, but haven't gotten around to it yet.

The cleaner way to do this, though, would probably be to add a sequence table, which then could also include the lengths of each contig.

s0 <- tribble(
  ~bin_id, ~seq_id, ~length,
  "Virus", "Virus", 3000,
  "S", "S1", 400,
  "S", "S2", 1600,
  "S", "S3", 1000
)

gg5 <- gggenomes(d1, s0) %>% add_clusters(clusters) +
  geom_seq() + geom_gene(aes(fill=name)) + geom_link() +
  geom_bin_label()
gg5

image

And if necessary, you could further modify the plot to arrange contigs in the right order/direction with pick_seqs/flip_seqs.

gg5 %>% flip_seqs(S1) + geom_seq_label(nudge_y=-.05)

image

thackl commented 3 years ago

I've now also implemented drawing links between genomes that are not adjacent to each other (https://github.com/thackl/gggenomes/issues/70#issuecomment-902998817). I will publish the updated code shortly.

StromTroopers commented 3 years ago

Thank you very much for your help and involvement, all this code will be very useful for me later on!