thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
581 stars 64 forks source link

flip() function for genes containing introns #91

Open xvtyzn opened 2 years ago

xvtyzn commented 2 years ago

Hi again,

I faced the problem of flip function against genes containing intron. This is from public database.

tmp_gg <- read_gff3("tmp.gff3.txt") %>%
  rename_seq() %>%
  unchop(parent_ids) %>%
  gggenomes() + 
  geom_gene(aes(fill=type, group=parent_ids), size = 5)  +
  scale_fill_brewer(palette = "Set1")

tmp_gg

cd551f25-2564-417c-a067-83cbd0f70345

tmp_gg %>% flip_seqs(4)

cc495749-aaa6-4b36-9582-cb7dfd2af960

The exon does not reverse correctly.

rename_seq function is the following self-made function. It is essentially run to recognize genes that are on the same genome as different bins. It is probably specific to this data and not a common function.

rename_seq <- function(gff_data, gene_list = NA){
  if(!is.na(gene_list)){
    renamed_genes <- gff_data %>%
      dplyr::filter(parent_ids %in% gene_list | 
                      feat_id %in% gene_list) 
  } else {
    renamed_genes <- gff_data %>%
      arrange(seq_id) 
  }

  renamed_gff <- renamed_genes %>%
    mutate(seq_ids_psuedo = ifelse(grepl(".m1$",parent_ids),
                                 str_replace(parent_ids, 
                                             pattern=".m1$", replacement=""), 
                                 parent_ids)) %>%
    mutate(seq_id = seq_ids_psuedo) %>%
    unchop(seq_id)
  return(renamed_gff)
}

I guess it may be related to #65, but I don't know the details. Thanks in advance.

Keigo

tmp.gff3.txt

thackl commented 2 years ago

Yep, that is definitely off! Right now, there is good and bad news.

Good news, I'm testing a fix for #65, which seems to work and should make your life easier (will upload soon).

# read_gff3 now can handle Augustus CDS
g0 <- read_gff3("issue_91/tmp.gff3.txt", fix_augustus_cds = TRUE)
# rename_seq() a bit more concise ;)
g1 <- g0 %>% mutate(seq_id = str_remove(map(parent_ids, 1), ".m1$"))

gggenomes(g1) + 
  geom_gene(aes(fill=type), size = 5)  +
  scale_fill_brewer(palette = "Set1")

image

Bad news, fixing flip() needs some more thought. The problem are the exons. All other features can simply be flipped by toggling their strand. This is also true for CDS because I merge all parts from one CDS into one feature internally. However, exons stay independent features, so instead of flipping all exons of one gene together (like all CDS parts), each exon is flipped on its own, which doesn't give the intended outcome. Need to ponder...

xvtyzn commented 2 years ago

Wow, thanks a lot.

Solving the Augustus problem is definitely great for me. Thanks for telling me about the str_remove function too!

OK, it’s seem to be difficult to take exon into account...

biubiu-1 commented 10 months ago

Hope the exon flipping issue got fixed someday. That will be really useful to illustrate the introns of eukaryotic genomes.