iTaxoTools / TaxI2

Calculation and analysis of pairwise sequence distances
GNU General Public License v3.0
0 stars 0 forks source link

Implement simple sequence statistics #13

Closed mvences closed 3 years ago

mvences commented 3 years ago

One additional output file "Sequence summary statistics" should be given, for the respective input file. For this, the program will simply go through all of the sequences in the input file and provide some basic summary statistics for them, such as:

Provide this output first for the total of all sequences in the input file.

If the input is a tab file or a Genbank file, i.e., there is information on species, then provide the same information also for each species in the data set.

Maybe the information can best be provided as a table, even if the column headers will consist of rather long text...

For the N50, L50, N75, L75 statistics you can have a look at this Wikipedia site: https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics (note that for the purpose of TaxI3, each sequence in our input file corresponds to a "contig" in the terminology of these metrics). There are also some tools that have already coded this, for instance: https://github.com/sandyjmacdonald/fast_stats https://github.com/MikeTrizna/assembly_stats

necrosovereign commented 3 years ago

How should '?' characters be treated? In particular, should they be included in the missing data percentage?

mvences commented 3 years ago

Good point, we need to be precise in this. The ? definitely counts as missing data, but it is more ambiguous what to do with gaps - which sometimes also are very equivalent to missing data (if used at the beginning and end of aligned sequences).

I would say, we give two percentages:

Percentage of missing data: N n ? Percentage of missing data including gaps: N n ? -