Closed jolespin closed 2 years ago
seqkit fx2tab
has a flag:
-g, --gc print GC content
GC content is more meaningful to individual sequences than a file, I think.
For metagenomics, I usually have metagenome assembled genomes split up in multiple contigs but organized as individual files. I admit, it would probably only be useful in that situation so I can see why it wouldn't be included in the stats module.
Thanks for the feedback. If the file contains only one sequence, it would be easy to extract:
$ seqkit head -n 1 ../tests/hairpin.fa \
| seqkit fx2tab -ni -g | cut -f 2
43.43
For multiple sequences:
gc_cnt=$(cat ../tests/hairpin.fa | seqkit fx2tab -ni -C gc | csvtk summary -Ht -f 2:sum -w 0)
sum_len=$(cat ../tests/hairpin.fa | seqkit stats -T | csvtk cut -t -f sum_len | csvtk del-header)
gc=$(echo "scale=4;" $gc_cnt / $sum_len "*" 100 | bc)
echo $gc
45.7700
No... It's too verbose, I'll add GC content to seqkit stats
.
Added. I think It's useful for checking bacterial assemblies.
$ seqkit stats -a ../tests/hairpin.fa
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
../tests/hairpin.fa FASTA RNA 28,645 2,949,871 39 103 2,354 76 91 111 0 101 0 0 45.77
Available in v2.3.0 : https://github.com/shenwei356/seqkit/releases/tag/v2.3.0
Awesome, thanks for adding in this feature. This will streamline my workflow as I'm trying to replace most of my sequencing manipulations/stats with seqkit.
Prerequisites
seqkit version
Describe your issue
[ ] describe the problem I'm trying to replace my Quast workflow with seqkit stats but the one bit that's missing is the GC calculation.
[ ] provide a reproducible example
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC