Open BethYates opened 1 week ago
After disussing this with the ToL Genome Notes editor they would like to see OMark https://omark.omabrowser.org/ included in the set of tools to evaluate an annotation set
To close this issue:
--annotation_set
which takes in a file path for a directory containing ensembl gene annotation. This directory should have files consistent with the ouput directory from running the sanger-tol/ensemblgenedownload pipeline annotation_stats
,annotation_stats
and configure it to run only if the --annotation_set
parameter is passed.- TRANSC_MRNA: the number of transcribed mRNAs
- PCG: the number of protein coding genes
- NCG: the number of non-coding genes
- CDS_PER_GENE: the average number of coding transcripts per gene
- EXONS_PER_TRANSC: the average number of exons per transcript
- CDS_LENGTH: the average length of coding sequence
- EXON_SIZE: the average length of a coding exon
- INTRON_SIZE: the average length of coding intron size
annotation_statistics
, The CSV file should have the following columns,
Variable,Value
where the variable is the name of the variable from the list above an the value is the statistic you have generated.
We want to be able to include some standard basic statistics on the gene/protein annotation set for an assembly in a genome note. This sub workflow should accept an annotation set and calculate some statistics, (exact values still to be determined but will most likely be things like the number of protein coding genes, number of non-coding genes, exons per transcript etc as well as BUSCO scores).
This could be a standalone pipeline or could be added to either the genomenote pipeline or to the ensemblgenedownload pipeline, although it may not always be Ensembl that provides the annotations.