Implement simple sequence statistics

mvences commented 3 years ago

One additional output file "Sequence summary statistics" should be given, for the respective input file. For this, the program will simply go through all of the sequences in the input file and provide some basic summary statistics for them, such as:

total number of sequences
number of sequences of less than 100 bp (= nucleotides / characters, not counting gaps, i.e., dashes)
number of sequences of 101-300 bp (not counting dashes)
number of sequences of 301-1000 bp (not counting dashes)
number of sequences >1000 bp (not counting dashes)
minimum, maximum, mean, median, standard deviation of sequence length (not counting dashes)
some specific DNA sequence metrics: N50 L50 N75 L75 (see below for specifications) (again, not counting dashes)
Total length of all sequences (sum of all sequence lengths) (again, not counting dashes)
percentage of each of the bases A, C, T, G in the sequences
GC content (percentage of G+C in the sequences)
percentage of of N (missing data)
percentage of ambiguity codes R+Y+S+W+K+M (only one percentage for all of them together)

Provide this output first for the total of all sequences in the input file.

If the input is a tab file or a Genbank file, i.e., there is information on species, then provide the same information also for each species in the data set.

Maybe the information can best be provided as a table, even if the column headers will consist of rather long text...

For the N50, L50, N75, L75 statistics you can have a look at this Wikipedia site: https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics (note that for the purpose of TaxI3, each sequence in our input file corresponds to a "contig" in the terminology of these metrics). There are also some tools that have already coded this, for instance: https://github.com/sandyjmacdonald/fast_stats https://github.com/MikeTrizna/assembly_stats

necrosovereign commented 3 years ago

How should '?' characters be treated? In particular, should they be included in the missing data percentage?

mvences commented 3 years ago

Good point, we need to be precise in this. The ? definitely counts as missing data, but it is more ambiguous what to do with gaps - which sometimes also are very equivalent to missing data (if used at the beginning and end of aligned sequences).

I would say, we give two percentages:

Percentage of missing data: N n ? Percentage of missing data including gaps: N n ? -

iTaxoTools / TaxI2

Implement simple sequence statistics #13