shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.28k stars 157 forks source link

[Feature Request] Include GC content calculation for seqkit stats -a #294

Closed jolespin closed 2 years ago

jolespin commented 2 years ago

Prerequisites

Describe your issue

shenwei356 commented 2 years ago

seqkit fx2tab has a flag:

  -g, --gc                     print GC content

GC content is more meaningful to individual sequences than a file, I think.

jolespin commented 2 years ago

For metagenomics, I usually have metagenome assembled genomes split up in multiple contigs but organized as individual files. I admit, it would probably only be useful in that situation so I can see why it wouldn't be included in the stats module.

shenwei356 commented 2 years ago

Thanks for the feedback. If the file contains only one sequence, it would be easy to extract:

$ seqkit head -n 1 ../tests/hairpin.fa \
    | seqkit fx2tab -ni -g  | cut -f 2
43.43

For multiple sequences:

gc_cnt=$(cat ../tests/hairpin.fa | seqkit fx2tab -ni -C gc | csvtk summary  -Ht -f 2:sum -w 0)
sum_len=$(cat ../tests/hairpin.fa | seqkit stats -T | csvtk cut -t -f sum_len | csvtk del-header)
gc=$(echo "scale=4;" $gc_cnt / $sum_len "*" 100  | bc)

echo $gc
45.7700

No... It's too verbose, I'll add GC content to seqkit stats.

shenwei356 commented 2 years ago

Added. I think It's useful for checking bacterial assemblies.

$ seqkit stats -a ../tests/hairpin.fa
file                 format  type  num_seqs    sum_len  min_len  avg_len  max_len  Q1  Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  GC(%)
../tests/hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,354  76  91  111        0  101       0       0  45.77
shenwei356 commented 2 years ago

Available in v2.3.0 : https://github.com/shenwei356/seqkit/releases/tag/v2.3.0

jolespin commented 2 years ago

Awesome, thanks for adding in this feature. This will streamline my workflow as I'm trying to replace most of my sequencing manipulations/stats with seqkit.