Bring in gfastats --seq-report?

Astahlke commented 2 years ago

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

Astahlke commented 2 years ago

Installed for testing in /project/ag100pest/software/gfastats For example,

$ /project/ag100pest/software/gfastats/build/bin/gfastats -h
gfastats input.[fasta|fastq|gfa][.gz] [expected genome size] [header[:start-end]]
genome size: estimated genome size for NG* statistics (optional).
header: target specific sequence by header, optionally with coordinates (optional).

Options:
-a --agp-to-path <file> converts input agp to path and replaces existing paths.
-b --out-coord a|s|c|g generates bed coordinates of given feature (agp|scaffolds|contigs|gaps default:agp).
-e --exclude-bed <file> opposite of --include-bed. They can be combined (no coordinates).
-f --input-sequence <file> input file (fasta, fastq, gfa [.gz]). Also as first positional argument.
-h --help print help and exit.
-i --include-bed <file> generates output on a subset list of headers or coordinates in 0-based bed format.
-k --swiss-army-knife <file> set of instructions provided as an ordered list.
-j --threads <n> numbers of threads (default:max).
-o --out-format fasta|fastq|gfa[.gz] outputs selected sequences. If more than the extension is provided the output is written to the specified file (e.g. out.fasta.gz).
-s --out-size s|c|g  generates size list of given feature (scaffolds|contigs|gaps default:scaffolds).
-t --tabular output in tabular format.
-v --version software version.
--cmd print $0 to stdout.
--discover-paths prototype to induce paths from input.
--homopolymer-compress <threshhold> compress all the homopolymers in the input above the given threshhold.
--line-length <n> specifies line length in when output format is fasta. Default has no line breaks.
--nstar-report generates full N* and L* statistics.
--out-sequence reports also the actual sequence (in combination with --seq-report).
--out-bubbles outputs a potential list of bubbles in the graph.
--seq-report report statistics for each sequence.
--sort ascending|descending|largest|smallest|file sort sequences according to input. Ascending/descending used the sequence/path header.
--stats report summary statistics (default).
--verbose verbose output.
--locale set a different locale, for instance to use , for thousand separators use en_US.UTF-8.

All input files can be piped from stdin using '-'.

On Diorhabda carinulata initial HiFiASM contig gfa

$ /project/ag100pest/software/gfastats/build/bin/gfastats Diorhabda_carinulata.p_ctg.gfa 350000000
+++Assembly summary+++:
Expected genome size: 350000000
# scaffolds: 0
Total scaffold length: 0
Average scaffold length: nan
Scaffold N50: 0
Scaffold auN: 0.00
Scaffold L50: 0
Scaffold NG50: 0
Scaffold auNG: 0.00
Scaffold LG50: 0
Largest scaffold: 0
Smallest scaffold: 0
# contigs: 0
Total contig length: 0
Average contig length: nan
Contig N50: 0
Contig auN: 0.00
Contig L50: 0
Contig NG50: 0
Contig auNG: 0.00
Contig LG50: 0
Largest contig: 0
Smallest contig: 0
# gaps in scaffolds: 0
Total gap length in scaffolds: 0
Average gap length in scaffolds: 0.00
Gap N50 in scaffolds: 0
Gap auN in scaffolds: 0.00
Gap L50 in scaffolds: 0
Largest gap in scaffolds: 0
Smallest gap in scaffolds: 0
Base composition (A:C:G:T): 0:0:0:0
GC content %: nan
# soft-masked bases: 0
# segments: 93
Total segment length: 417730121
Average segment length: 4491721.73
# gaps: 0
# paths: 0
# edges: 26
Average degree: 0.28
# connected components: 7
Largest connected component length: 163552
# dead ends: 165
# disconnected components: 80
Total length disconnected components: 417055355
# separated components: 87
# bubbles: 0

On Diorhabda carinulata final fasta

$ /project/ag100pest/software/gfastats/build/bin/gfastats *.fasta 350000000
+++Assembly summary+++:
Expected genome size: 350000000
# scaffolds: 25
Total scaffold length: 411193173
Average scaffold length: 16447726.92
Scaffold N50: 34144059
Scaffold auN: 35899508.27
Scaffold L50: 5
Scaffold NG50: 36256822
Scaffold auNG: 42176093.47
Scaffold LG50: 4
Largest scaffold: 67688434
Smallest scaffold: 3629
# contigs: 43
Total contig length: 411191373
Average contig length: 9562590.07
Contig N50: 23459757
Contig auN: 20916145.74
Contig L50: 8
Contig NG50: 24084167
Contig auNG: 24572967.67
Contig LG50: 6
Largest contig: 36961981
Smallest contig: 3629
# gaps in scaffolds: 18
Total gap length in scaffolds: 1800
Average gap length in scaffolds: 100.00
Gap N50 in scaffolds: 100
Gap auN in scaffolds: 100.00
Gap L50 in scaffolds: 9
Largest gap in scaffolds: 100
Smallest gap in scaffolds: 100
Base composition (A:C:G:T): 139644713:65900086:65842010:139804564
GC content %: 32.04
# soft-masked bases: 1800
# segments: 43
Total segment length: 411191373
Average segment length: 9562590.07
# gaps: 18
# paths: 25

For comparison, here's bbstats:

$ bbstats.sh Dcau_scaffolds.no_mito.no_contaminants.fasta
A   C   G   T   N   IUPAC   Other   GC  GC_stdev
0.3396  0.1603  0.1601  0.3400  0.0000  0.0000  0.0000  0.3204  0.0310

Main genome scaffold total:             25
Main genome contig total:               43
Main genome scaffold sequence total:    411.193 MB
Main genome contig sequence total:      411.191 MB      0.000% gap
Main genome scaffold N/L50:             5/34.144 MB
Main genome contig N/L50:               8/23.46 MB
Main genome scaffold N/L90:             12/14.829 MB
Main genome contig N/L90:               21/7.878 MB
Max scaffold length:                    67.688 MB
Max contig length:                      36.962 MB
Number of scaffolds > 50 KB:            15
% main genome in scaffolds > 50 KB:     99.96%

Need to decide as a group whether it's valuable to switch to this.

molikd commented 2 years ago

hmm, maybe we replace stats.sh?

molikd commented 2 years ago

From Meeting August 22nd:

[ ] create new singularity container
[ ] change over stats processes to new container

Astahlke commented 2 years ago

Is this done, per https://github.com/molikd/otb/commit/b6ae0e399a8843b30352139f9bce0f1b379e1364 ?

molikd / otb

Bring in gfastats --seq-report? #41