legumeinfo / LIS-autocontent

Scrapes the LIS datastore and populates various configs and databases for deployment
Apache License 2.0
0 stars 1 forks source link

Stats for gene_models_main Files #31

Open ctcncgr opened 9 months ago

ctcncgr commented 9 months ago

We need to come up with some stats that we want to display for the gene_models files that are consumed.

Its pretty easy to do counts of field 3 and just report these, but should we do more?

adf-ncgr commented 8 months ago

Hi Connor- I haven't thought about this too much, but I think it would be nice to have somewhat more comprehensive stats. Having glanced at/quick-tested a few ready-made options for getting something along these lines, I'm inclining towards agat_sp_statistics.pl which is part of https://agat.readthedocs.io/en/latest/index.html Here's a snippet of output for the ann1 vs ann2 on arahy.Tifrunner.gnm2 (I just pasted the individual results to get a side by side comparison to satisfy my own curiosity):

Compute mrna with isoforms if any   Compute mrna with isoforms if any

Number of genes                              67005  Number of genes                              81717
Number of mrnas                              84519  Number of mrnas                              81717
Number of mrnas with utr both sides          40230  Number of mrnas with utr both sides          45192
Number of mrnas with at least one utr        58797  Number of mrnas with at least one utr        51453
Number of cdss                               84519  Number of cdss                               81717
Number of exons                              543827 Number of exons                              424835
Number of five_prime_utrs                    48482  Number of five_prime_utrs                    47678
Number of three_prime_utrs                   50545  Number of three_prime_utrs                   48967
Number of exon in cds                        501220 Number of exon in cds                        395401
Number of exon in five_prime_utr             69934  Number of exon in five_prime_utr             64811
Number of exon in three_prime_utr            71327  Number of exon in three_prime_utr            60798
Number of intron in cds                      416701 Number of intron in cds                      313684
Number of intron in exon                     459308 Number of intron in exon                     343118
Number of intron in five_prime_utr           21452  Number of intron in five_prime_utr           17133
Number of intron in three_prime_utr          20782  Number of intron in three_prime_utr          11831
Number gene overlapping                      2619   Number gene overlapping                      6252
Number of single exon gene                   4342   Number of single exon gene                   10607
Number of single exon mrna                   4342   Number of single exon mrna                   10607
mean mrnas per gene                          1.3    mean mrnas per gene                          1.0
mean cdss per mrna                           1.0    mean cdss per mrna                           1.0
mean exons per mrna                          6.4    mean exons per mrna                          5.2
mean five_prime_utrs per mrna                0.6    mean five_prime_utrs per mrna                0.6
mean three_prime_utrs per mrna               0.6    mean three_prime_utrs per mrna               0.6
mean exons per cds                           5.9    mean exons per cds                           4.8
mean exons per five_prime_utr                1.4    mean exons per five_prime_utr                1.4
mean exons per three_prime_utr               1.4    mean exons per three_prime_utr               1.2
mean introns in cdss per mrna                4.9    mean introns in cdss per mrna                3.8
mean introns in exons per mrna               5.4    mean introns in exons per mrna               4.2
mean introns in five_prime_utrs per mrna     0.3    mean introns in five_prime_utrs per mrna     0.2
mean introns in three_prime_utrs per mrna    0.2    mean introns in three_prime_utrs per mrna    0.1
Total gene length                            262875621  Total gene length                            303584458
Total mrna length                            352990253  Total mrna length                            303584458
Total cds length                             102403200  Total cds length                             89246482
Total exon length                            153915895  Total exon length                            130524008
Total five_prime_utr length                  19543492   Total five_prime_utr length                  16312458
Total three_prime_utr length                 31969203   Total three_prime_utr length                 24965068
Total intron length per cds                  180667164  Total intron length per cds                  150143918
Total intron length per exon                 199074358  Total intron length per exon                 173060450
Total intron length per five_prime_utr       10519071   Total intron length per five_prime_utr       13710523
Total intron length per three_prime_utr      7653779    Total intron length per three_prime_utr      8813277
mean gene length                             3923   mean gene length                             3715
mean mrna length                             4176   mean mrna length                             3715
mean cds length                              1211   mean cds length                              1092
mean exon length                             283    mean exon length                             307
mean five_prime_utr length                   403    mean five_prime_utr length                   342
mean three_prime_utr length                  632    mean three_prime_utr length                  509
...
Longest gene                                 342359 Longest gene                                 342359
Longest mrna                                 342359 Longest mrna                                 342359
Longest cds                                  16374  Longest cds                                  16272
Longest exon                                 14759  Longest exon                                 72007
Longest five_prime_utr                       15289  Longest five_prime_utr                       54108
Longest three_prime_utr                      15367  Longest three_prime_utr                      51521
Longest cds piece                            7977   Longest cds piece                            7977
Longest five_prime_utr piece                 14561  Longest five_prime_utr piece                 54108
Longest three_prime_utr piece                9844   Longest three_prime_utr piece                51521
Longest intron into cds part                 177377 Longest intron into cds part                 192003
Longest intron into exon part                177377 Longest intron into exon part                194085
Longest intron into five_prime_utr part      10997  Longest intron into five_prime_utr part      129751
Longest intron into three_prime_utr part     9921   Longest intron into three_prime_utr part     194085
Shortest gene                                163    Shortest gene                                102
Shortest mrna                                163    Shortest mrna                                102
Shortest cds                                 75 Shortest cds                                 78
Shortest exon                                3  Shortest exon                                1
Shortest five_prime_utr                      1  Shortest five_prime_utr                      1
Shortest three_prime_utr                     1  Shortest three_prime_utr                     1
Shortest cds piece                           1  Shortest cds piece                           1
Shortest five_prime_utr piece                1  Shortest five_prime_utr piece                1
Shortest three_prime_utr piece               1  Shortest three_prime_utr piece               1
Shortest intron into cds part                4  Shortest intron into cds part                4
Shortest intron into exon part               4  Shortest intron into exon part               4
Shortest intron into five_prime_utr part     5  Shortest intron into five_prime_utr part     17
Shortest intron into three_prime_utr part    12 Shortest intron into three_prime_utr part    18
...

arguably more info than we'd want to cram into a DSCensor report, but we could always be more selective about what we expose there. One nice thing about this tool is that it seems to do a lot of inference about things like exons and introns even when they are not explicit in the file (e.g. if you have CDS and UTRs). Also, it doesn't seem to be as fussy as some tools about what it will require validation-wise.

We could (and probably should) bring others into the conversation about it, but wanted to at least give you something to chew on for starters.