shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
999 stars 84 forks source link

sep output inconsistent when using -N vs -n and the -H flag #218

Closed davised closed 1 year ago

davised commented 1 year ago

Prerequisites

$ csvtk version
csvtk v0.25.0
$ csvtk sep -h
separate column into multiple columns

Usage:
  csvtk sep [flags]

Aliases:
  sep, separate

Flags:
...
      --merge           only splits at most N times, exclusive with --drop
...
  -n, --names strings   new column names
  -N, --num-cols int    preset number of new created columns
...
  -s, --sep string      separator
...

Global Flags:
...
  -H, --no-header-row          specifies that the input CSV file does not have header row
...

Describe your issue

csvtk sep does not provide the same output when headers are disabled with -H option.

See examples below:

$ cat 96_well_sample.txt | csvtk sep -H -s '_' -N 3 --merge | csvtk pretty -H | head
A1_S1_L001_R1_001.fastq.gz     A1    S1    L001   R1   001.fastq.gz
A1_S1_L001_R2_001.fastq.gz     A1    S1    L001   R2   001.fastq.gz
B1_S2_L001_R1_001.fastq.gz     B1    S2    L001   R1   001.fastq.gz
B1_S2_L001_R2_001.fastq.gz     B1    S2    L001   R2   001.fastq.gz
C1_S3_L001_R1_001.fastq.gz     C1    S3    L001   R1   001.fastq.gz
C1_S3_L001_R2_001.fastq.gz     C1    S3    L001   R2   001.fastq.gz
D1_S4_L001_R1_001.fastq.gz     D1    S4    L001   R1   001.fastq.gz
D1_S4_L001_R2_001.fastq.gz     D1    S4    L001   R2   001.fastq.gz
E1_S5_L001_R1_001.fastq.gz     E1    S5    L001   R1   001.fastq.gz
E1_S5_L001_R2_001.fastq.gz     E1    S5    L001   R2   001.fastq.gz

As you can see, the -N 3 --merge is not being respected. However, if headers are added:

$ cat 96_well_sample.txt | csvtk add-header -n fastq | csvtk sep -s '_' -n 'sample,sequence,suffix' --merge | csvtk pretty | head
fastq                          sample   sequence   suffix
----------------------------   ------   --------   --------------------
A1_S1_L001_R1_001.fastq.gz     A1       S1         L001_R1_001.fastq.gz
A1_S1_L001_R2_001.fastq.gz     A1       S1         L001_R2_001.fastq.gz
B1_S2_L001_R1_001.fastq.gz     B1       S2         L001_R1_001.fastq.gz
B1_S2_L001_R2_001.fastq.gz     B1       S2         L001_R2_001.fastq.gz
C1_S3_L001_R1_001.fastq.gz     C1       S3         L001_R1_001.fastq.gz
C1_S3_L001_R2_001.fastq.gz     C1       S3         L001_R2_001.fastq.gz
D1_S4_L001_R1_001.fastq.gz     D1       S4         L001_R1_001.fastq.gz
D1_S4_L001_R2_001.fastq.gz     D1       S4         L001_R2_001.fastq.gz

96_well_sample.txt

shenwei356 commented 1 year ago

Hi Ed, that's a bug, I've fixed it. Thanks for reporting this.

$  cat 96_well_sample.txt | csvtk sep -H -s '_' -N 3 --merge | csvtk pretty -H | head
A1_S1_L001_R1_001.fastq.gz     A1    S1    L001_R1_001.fastq.gz
A1_S1_L001_R2_001.fastq.gz     A1    S1    L001_R2_001.fastq.gz
B1_S2_L001_R1_001.fastq.gz     B1    S2    L001_R1_001.fastq.gz

$ cat 96_well_sample.txt | csvtk add-header -n fastq | csvtk sep -s '_' -n 'sample,sequence,suffix' --merge | csvtk pretty | head
fastq                          sample   sequence   suffix
----------------------------   ------   --------   --------------------
A1_S1_L001_R1_001.fastq.gz     A1       S1         L001_R1_001.fastq.gz
A1_S1_L001_R2_001.fastq.gz     A1       S1         L001_R2_001.fastq.gz
B1_S2_L001_R1_001.fastq.gz     B1       S2         L001_R1_001.fastq.gz

Please use the new binaries.