Closed StromTroopers closed 3 years ago
Thanks for reaching out, happy help!
d0 <- structure(
list(
Seq_id = structure(
c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L),
.Label = c("S1", "S2", "S3", "Virus"),
class = "factor"
),
Name = structure(
c(2L, 3L, 4L, 1L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L),
.Label = c("A10", "G1", "G2", "G3", "G4", "G5", "G6"),
class = "factor"
),
Start = c(300L, 40L, 1500L, 170L, 600L, 2600L, 20L, 70L, 250L, 400L, 800L, 900L),
End = c(330L, 65L, 1530L, 230L, 630L, 2670L, 50L, 110L, 270L, 500L, 830L,950L),
Strand = c(-1L, 1L, -1L, -1L, 1L, 1L, 1L, 1L, -1L, -1L, 1L, 1L)),
class = "data.frame", row.names = c(NA, -12L));
# Convert to recognized column names (lower case)
# https://thackl.github.io/gggenomes/articles/gggenomes.html#inside-gggenomes-tracks
d1 <- rename_with(d0, str_to_lower)
gg1 <- gggenomes(d1) + geom_seq() + geom_gene(aes(fill=name))
gg1
Note, without seq info, gggenomes zooms in on just the genes, i.e. trims seqs before/after the first/last gene. You can prevent that trimming in front with infer_start=0
, but the right length of sequences can obviously not be guessed from the gene coordinates alone.
gg2 <- gggenomes(d1, infer_start=0) + geom_seq() + geom_gene(aes(fill=name))
gg2
The easiest way to connect genes is to and draw the connections with geom_link()
. All you need is a simple data.frame with two columns (cluster_id, feat_id). Since your example genes did not have a unique ID column, but their names seem to represent their "clusters", we can use the automatically assigned gene IDs from gggenomes to create the clusters data.frame.
clusters <- pull_genes(gg2) %>% select(cluster_id=name, feat_id)
# A tibble: 12 x 2
cluster_id feat_id
<chr> <chr>
1 G1 f1
2 G2 f2
3 G3 f3
4 A10 f4
5 G5 f5
6 G6 f6
7 G1 f7
8 G2 f8
9 G3 f9
10 G4 f10
11 G5 f11
12 G6 f12
gg3 <- gg2 %>% add_clusters(clusters) + geom_link()
gg3
As you may notice, this only draws links between adjacent genomes, i.e. not from the top to the bottom genomes. That's gggenomes default behavior because for more complex datasets drawing all links quickly becomes very slow and very messy. In theory, drawing all links is possible, but trying it out just now, I noticed that I haven't properly implemented that yet (#70).
Note also, that if you rearrange genomes, the links will be updated to again show all links between adjacent genomes.
gg3 %>% pick(1,4,3,2)
Let me know if you need more info are have additional questions! Also, do you think being able to draw all links is an important feature?
Wow thank you for the rapidity ! Ok I see, so what if instead of drawing all links in different rows we could only draw something in the same axe such as that for instance :
Maybe it would be better right ? Do you think it is still possible with ggenomes?
Ah, didn't get that right away. Absolutely. One of the reasons I wrote gggenomes is to support multiple contigs per genome. To group seqs (contigs/chromosomes) into genomes, you need to add one more variable bin_id
. "bin" here is short for genome, inspired by metagenomic bins.
If bin_ids aren't provided, as in your dataset, gggenomes just sets bin_id=seq_id
, i.e. treats every sequence as a different genome. A somewhat hacky but quick way for your data is to change how gggenomes computes the bin_id
from the seq_id
. That can be done by providing an expression to infer_bin_id
. In your case, we can just use only the first character. You would also get the same result by adding an additional bin_id
column to your gene table.
# hacky
gg4 <- gggenomes(d1, infer_bin_id = str_sub(seq_id, 1, 1)) %>%
add_clusters(clusters) +
geom_seq() + geom_gene(aes(fill=name)) + geom_link() +
geom_bin_label()
# equivalent
d2 <- mutate(d1, bin_id = str_sub(seq_id, 1, 1))
gg4 <- gggenomes(d2) %>%
add_clusters(clusters) +
geom_seq() + geom_gene(aes(fill=name)) + geom_link() +
geom_bin_label()
gg4
What is missing are the nice ticks you have at the end of the sequences to indicate that they have been trimmed. I thought about adding such a geom
, but haven't gotten around to it yet.
The cleaner way to do this, though, would probably be to add a sequence table, which then could also include the lengths of each contig.
s0 <- tribble(
~bin_id, ~seq_id, ~length,
"Virus", "Virus", 3000,
"S", "S1", 400,
"S", "S2", 1600,
"S", "S3", 1000
)
gg5 <- gggenomes(d1, s0) %>% add_clusters(clusters) +
geom_seq() + geom_gene(aes(fill=name)) + geom_link() +
geom_bin_label()
gg5
And if necessary, you could further modify the plot to arrange contigs in the right order/direction with pick_seqs/flip_seqs
.
gg5 %>% flip_seqs(S1) + geom_seq_label(nudge_y=-.05)
I've now also implemented drawing links between genomes that are not adjacent to each other (https://github.com/thackl/gggenomes/issues/70#issuecomment-902998817). I will publish the updated code shortly.
Thank you very much for your help and involvement, all this code will be very useful for me later on!
Hello, first of all I would like to thank you for this package (I find a lack of packages that allow to really plot genomic data in synteny).
I am writing this post to seek some help in the particular use of gggenomes. Indeed in my data I need to realize a rather simple but particular figure.
Here are my data:
With this dataframe I managed with the ggenes package to create the following plot :
And then I would need to add synteny links between the Sn scaffolds and their corresponding Name in the Virus scaffold and get something like this :
I thing ggenomes cand do that but I did not find a way to do it this way ... Can you help me to better understand the code please ? Thanks a lot for your help.
Here is the dput table format :