thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
579 stars 64 forks source link

Focus on different region of different genomes #128

Closed LemoAlex closed 10 months ago

LemoAlex commented 1 year ago

Dear gggenomes users,

I have 6 bacterial genomes of ~3MB each, and I am interested in plotting genes which are mapped into different regions of each genome. I have my fasta file containing the 6 genomes and the gff files containg the 6 annotations. Up until here, it seems to be working :

all_seqs<-read_seqs('6genomes.fa',.id='file_id')
 all_genomes_genes <- read_gff("genomes.gff") 
plot1 <- gggenomes(seq=A118_seqs, A118_genes)
plot1

This works. However, as my gff files (obtained with prodigal) contain too many CDS, the plot is unreadable.

Anyhow, I would like to map different equivalent regions of ~60 KB , as, for instance:

genome 1 : 50000 to 110000 position genome2 : 500000 to 560000 position genome3 : 1000000 to 1060000 position etc...

These correspond to conserved regions which normally should show similar gene content.

How can this be done with the package ? I think it might be with the focus () option, but I am unsure about the procedure here.

Thank you for your help and for this very useful package,

Alexandre

thackl commented 1 year ago

Hi Alexandre,

you are on the right track. Two ways to do this. 1) You can add columns start and end to your all_seqs.

all_seqs <- mutate(start=c(50000,500000,1000000, ...), end=c(110000,560000,1060000, ...)

2) You can provide a table with locus coordinates for all your sequences (contigs) to focus.

You can use focus for this like this:

loci <- tribble(
~seq_id, ~start, ~end,
"genome1", 50000, 110000, # probably more realistic would be genomeA_contig1 or so
"genome2", 500000,  560000,
# ...
)

gggenomes(...) |> focus(.loci=loci)

Hope that helps

LemoAlex commented 1 year ago

Thank you for the answer!

I did try the first method, because the second one, with many genes in the region, might be too tedious.

So:

all_seqs <- mutate(start=c(50000,500000,1000000, ...), end=c(110000,560000,1060000, ...)

But I get the error message

Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')"

When I look at my all_seqs file, it looks like this:

file_id seq_id seq_desc length

1 6genomes g1 NA 3836532 2 6genomes g2 NA 3937483 3 6genomes g3 NA 3750370 4 6genomes g4 NA 3995103 5 6genomes g5 NA 4006609 6 6genomes g6 NA 3980852 I thought about cutting the fasta files directly and loading only the fractions of the genomes I am interested in, but I am afraid it will then be messy to format the gff files according to the positions. Cheers, Alexandre
LemoAlex commented 1 year ago

Hello again,

Sorry for the previous post, I fixed the issue by simply changing the line to:

all_seqs %>% mutate(start=c(50000,500000,1000000, ...), end=c(110000,560000,1060000, ...)

and it worked as I now have two extra columns in the all_seqs object.

However, how can I tell gggenomes to now only plot those regions rather than the entire genome ?

Cheers,

Alexandre

thackl commented 1 year ago

You need to give all_seqs with added start,end as seqs to gggenomes.

all_seqs_<-read_seqs('6genomes.fa',.id='file_id')
all_genomes_genes <- read_gff("genomes.gff") 

plot1 <- gggenomes(seq=A118_seqs, A118_genes)
plot1

all_loci <- all_seqs %>% mutate(start=c(50000,500000,1000000, ...), end=c(110000,560000,1060000, ...)

plot2 <- gggenomes(seq=all_loci,  all_genes)
plot2