thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
581 stars 64 forks source link

About read_gff3 and intron visualization #64

Closed xvtyzn closed 3 years ago

xvtyzn commented 3 years ago

Hi!

Thank you for creating this excellent tool. I have a question about drawing intron using read_gff3 and geom_gene.

I am interested in the gene structure of corals and tried to draw them from the attached gff file (This is data from a public database). I know that the structure is as follows based on the database.

スクリーンショット 2021-07-22 2 21 06

read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>% 
            gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")

スクリーンショット 2021-07-22 2 24 43

However, the mRNA structure is different from what I expected.

I noticed that the order of the numeric in vectors in the introns column was wrong, so I did some sorting.

read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>% 
            mutate(introns = map(introns, sort)) %>%
            gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")

スクリーンショット 2021-07-22 2 25 48

However, the structure is still different.

I really appreciate any help you can provide.

Keigo

aten_0.1.m1.802.gff.txt

xvtyzn commented 3 years ago

Hi

I solved the cause of this error. It is caused by mrna_exon_introns in the read_gff3 function.

mrna_exon_introns <- filter(x, type == "exon") %>% select(exon_id = feat_id, 
        start, end, feat_id = parent_ids) %>% unchop(feat_id) %>% 
        group_by(feat_id) %>% summarize(introns = list(coords2introns(start, end)))
# A tibble: 14 x 4
# Groups:   feat_id [1]
   exon_id                     start     end feat_id           
   <chr>                       <int>   <int> <chr>             
 1 aten_0.1.m1.802.m1.exon1  9916243 9919080 aten_0.1.m1.802.m1
 2 aten_0.1.m1.802.m1.exon10 9930781 9931032 aten_0.1.m1.802.m1
 3 aten_0.1.m1.802.m1.exon11 9932680 9932805 aten_0.1.m1.802.m1
 4 aten_0.1.m1.802.m1.exon12 9935662 9935778 aten_0.1.m1.802.m1
 5 aten_0.1.m1.802.m1.exon13 9935983 9936081 aten_0.1.m1.802.m1
 6 aten_0.1.m1.802.m1.exon14 9938879 9939091 aten_0.1.m1.802.m1
 7 aten_0.1.m1.802.m1.exon2  9920151 9920639 aten_0.1.m1.802.m1
 8 aten_0.1.m1.802.m1.exon3  9921529 9921776 aten_0.1.m1.802.m1
 9 aten_0.1.m1.802.m1.exon4  9922434 9922598 aten_0.1.m1.802.m1
10 aten_0.1.m1.802.m1.exon5  9924117 9924315 aten_0.1.m1.802.m1
11 aten_0.1.m1.802.m1.exon6  9924592 9924801 aten_0.1.m1.802.m1
12 aten_0.1.m1.802.m1.exon7  9927854 9927979 aten_0.1.m1.802.m1
13 aten_0.1.m1.802.m1.exon8  9928552 9928674 aten_0.1.m1.802.m1
14 aten_0.1.m1.802.m1.exon9  9930577 9930693 aten_0.1.m1.802.m1

スクリーンショット 2021-07-22 14 59 50

The problem can be solved by sorting the start column for this exon tibble using arrange function.

mrna_exon_introns <- filter(x, type == "exon") %>% select(exon_id = feat_id, 
        start, end, feat_id = parent_ids) %>% unchop(feat_id) %>% 
        group_by(feat_id) %>% arrange(start) %>% summarize(introns = list(coords2introns(start, end)))
# A tibble: 14 x 4
# Groups:   feat_id [1]
   exon_id                     start     end feat_id           
   <chr>                       <int>   <int> <chr>             
 1 aten_0.1.m1.802.m1.exon1  9916243 9919080 aten_0.1.m1.802.m1
 2 aten_0.1.m1.802.m1.exon2  9920151 9920639 aten_0.1.m1.802.m1
 3 aten_0.1.m1.802.m1.exon3  9921529 9921776 aten_0.1.m1.802.m1
 4 aten_0.1.m1.802.m1.exon4  9922434 9922598 aten_0.1.m1.802.m1
 5 aten_0.1.m1.802.m1.exon5  9924117 9924315 aten_0.1.m1.802.m1
 6 aten_0.1.m1.802.m1.exon6  9924592 9924801 aten_0.1.m1.802.m1
 7 aten_0.1.m1.802.m1.exon7  9927854 9927979 aten_0.1.m1.802.m1
 8 aten_0.1.m1.802.m1.exon8  9928552 9928674 aten_0.1.m1.802.m1
 9 aten_0.1.m1.802.m1.exon9  9930577 9930693 aten_0.1.m1.802.m1
10 aten_0.1.m1.802.m1.exon10 9930781 9931032 aten_0.1.m1.802.m1
11 aten_0.1.m1.802.m1.exon11 9932680 9932805 aten_0.1.m1.802.m1
12 aten_0.1.m1.802.m1.exon12 9935662 9935778 aten_0.1.m1.802.m1
13 aten_0.1.m1.802.m1.exon13 9935983 9936081 aten_0.1.m1.802.m1
14 aten_0.1.m1.802.m1.exon14 9938879 9939091 aten_0.1.m1.802.m1

スクリーンショット 2021-07-22 14 57 28

Thanks!

Keigo

thackl commented 3 years ago

Hi Keigo,

thanks a lot for bringing this to our attention. The intron/exon support is still quite experimental and getting feedback like this really helps! And great job on spotting the issue in the code.

Your fix is a nice one! Although, I think that ideally read_gff() should not get the order of exons wrong in the first place, i.e. read them as they are given in the gff file. Then resorting would not be necessary. I'm currently on holidays. I'll look into this once I'm back!

xvtyzn commented 3 years ago

Thank you very much!

I really like gggenomes package, and I'm willing to help test it. Have a nice holiday!

thackl commented 3 years ago

@xvtyzn just wanted to say thanks again for reporting this issue. read_gff3() now by defaults checks and if warranted sorts exon/CDS coordinates for each gene in a file based on start coordinates.

Also, your data made me aware of an issue with reading gffs produced by Augustus. The CDS annotations in those files don't follow proper gff format, and therefore are not properly read/displayed - you probably noticed that as well. I might try to find a workaround for that in the future. See #65 for more info

Cheers Thomas