Open ctcncgr opened 9 months ago
Hi Connor- I haven't thought about this too much, but I think it would be nice to have somewhat more comprehensive stats. Having glanced at/quick-tested a few ready-made options for getting something along these lines, I'm inclining towards agat_sp_statistics.pl which is part of https://agat.readthedocs.io/en/latest/index.html Here's a snippet of output for the ann1 vs ann2 on arahy.Tifrunner.gnm2 (I just pasted the individual results to get a side by side comparison to satisfy my own curiosity):
Compute mrna with isoforms if any Compute mrna with isoforms if any
Number of genes 67005 Number of genes 81717
Number of mrnas 84519 Number of mrnas 81717
Number of mrnas with utr both sides 40230 Number of mrnas with utr both sides 45192
Number of mrnas with at least one utr 58797 Number of mrnas with at least one utr 51453
Number of cdss 84519 Number of cdss 81717
Number of exons 543827 Number of exons 424835
Number of five_prime_utrs 48482 Number of five_prime_utrs 47678
Number of three_prime_utrs 50545 Number of three_prime_utrs 48967
Number of exon in cds 501220 Number of exon in cds 395401
Number of exon in five_prime_utr 69934 Number of exon in five_prime_utr 64811
Number of exon in three_prime_utr 71327 Number of exon in three_prime_utr 60798
Number of intron in cds 416701 Number of intron in cds 313684
Number of intron in exon 459308 Number of intron in exon 343118
Number of intron in five_prime_utr 21452 Number of intron in five_prime_utr 17133
Number of intron in three_prime_utr 20782 Number of intron in three_prime_utr 11831
Number gene overlapping 2619 Number gene overlapping 6252
Number of single exon gene 4342 Number of single exon gene 10607
Number of single exon mrna 4342 Number of single exon mrna 10607
mean mrnas per gene 1.3 mean mrnas per gene 1.0
mean cdss per mrna 1.0 mean cdss per mrna 1.0
mean exons per mrna 6.4 mean exons per mrna 5.2
mean five_prime_utrs per mrna 0.6 mean five_prime_utrs per mrna 0.6
mean three_prime_utrs per mrna 0.6 mean three_prime_utrs per mrna 0.6
mean exons per cds 5.9 mean exons per cds 4.8
mean exons per five_prime_utr 1.4 mean exons per five_prime_utr 1.4
mean exons per three_prime_utr 1.4 mean exons per three_prime_utr 1.2
mean introns in cdss per mrna 4.9 mean introns in cdss per mrna 3.8
mean introns in exons per mrna 5.4 mean introns in exons per mrna 4.2
mean introns in five_prime_utrs per mrna 0.3 mean introns in five_prime_utrs per mrna 0.2
mean introns in three_prime_utrs per mrna 0.2 mean introns in three_prime_utrs per mrna 0.1
Total gene length 262875621 Total gene length 303584458
Total mrna length 352990253 Total mrna length 303584458
Total cds length 102403200 Total cds length 89246482
Total exon length 153915895 Total exon length 130524008
Total five_prime_utr length 19543492 Total five_prime_utr length 16312458
Total three_prime_utr length 31969203 Total three_prime_utr length 24965068
Total intron length per cds 180667164 Total intron length per cds 150143918
Total intron length per exon 199074358 Total intron length per exon 173060450
Total intron length per five_prime_utr 10519071 Total intron length per five_prime_utr 13710523
Total intron length per three_prime_utr 7653779 Total intron length per three_prime_utr 8813277
mean gene length 3923 mean gene length 3715
mean mrna length 4176 mean mrna length 3715
mean cds length 1211 mean cds length 1092
mean exon length 283 mean exon length 307
mean five_prime_utr length 403 mean five_prime_utr length 342
mean three_prime_utr length 632 mean three_prime_utr length 509
...
Longest gene 342359 Longest gene 342359
Longest mrna 342359 Longest mrna 342359
Longest cds 16374 Longest cds 16272
Longest exon 14759 Longest exon 72007
Longest five_prime_utr 15289 Longest five_prime_utr 54108
Longest three_prime_utr 15367 Longest three_prime_utr 51521
Longest cds piece 7977 Longest cds piece 7977
Longest five_prime_utr piece 14561 Longest five_prime_utr piece 54108
Longest three_prime_utr piece 9844 Longest three_prime_utr piece 51521
Longest intron into cds part 177377 Longest intron into cds part 192003
Longest intron into exon part 177377 Longest intron into exon part 194085
Longest intron into five_prime_utr part 10997 Longest intron into five_prime_utr part 129751
Longest intron into three_prime_utr part 9921 Longest intron into three_prime_utr part 194085
Shortest gene 163 Shortest gene 102
Shortest mrna 163 Shortest mrna 102
Shortest cds 75 Shortest cds 78
Shortest exon 3 Shortest exon 1
Shortest five_prime_utr 1 Shortest five_prime_utr 1
Shortest three_prime_utr 1 Shortest three_prime_utr 1
Shortest cds piece 1 Shortest cds piece 1
Shortest five_prime_utr piece 1 Shortest five_prime_utr piece 1
Shortest three_prime_utr piece 1 Shortest three_prime_utr piece 1
Shortest intron into cds part 4 Shortest intron into cds part 4
Shortest intron into exon part 4 Shortest intron into exon part 4
Shortest intron into five_prime_utr part 5 Shortest intron into five_prime_utr part 17
Shortest intron into three_prime_utr part 12 Shortest intron into three_prime_utr part 18
...
arguably more info than we'd want to cram into a DSCensor report, but we could always be more selective about what we expose there. One nice thing about this tool is that it seems to do a lot of inference about things like exons and introns even when they are not explicit in the file (e.g. if you have CDS and UTRs). Also, it doesn't seem to be as fussy as some tools about what it will require validation-wise.
We could (and probably should) bring others into the conversation about it, but wanted to at least give you something to chew on for starters.
We need to come up with some stats that we want to display for the gene_models files that are consumed.
Its pretty easy to do counts of field 3 and just report these, but should we do more?