shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
1.01k stars 84 forks source link

mutate2 converts E[0-9] to 0 in output column #219

Closed davised closed 1 year ago

davised commented 1 year ago

Prerequisites

Describe your issue

$ cat 96_well_sample.txt | csvtk add-header -n fastq | csvtk sep -n 'sample,sequence,L001,suffix' --merge -f 1 -s '_' | csvtk cut -f fastq,sample,suffix | csvtk pretty | head
fastq                          sample   suffix
----------------------------   ------   ---------------
E_S1_L001_R1_001.fastq.gz      E        R1_001.fastq.gz
E0_S1_L001_R2_001.fastq.gz     E0       R2_001.fastq.gz
A1_S1_L001_R1_001.fastq.gz     A1       R1_001.fastq.gz
A1_S1_L001_R2_001.fastq.gz     A1       R2_001.fastq.gz
B1_S2_L001_R1_001.fastq.gz     B1       R1_001.fastq.gz
B1_S2_L001_R2_001.fastq.gz     B1       R2_001.fastq.gz
C1_S3_L001_R1_001.fastq.gz     C1       R1_001.fastq.gz
C1_S3_L001_R2_001.fastq.gz     C1       R2_001.fastq.gz

Here's what happens when I use mutate2 to join the columns:

$ cat 96_well_sample.txt | csvtk add-header -n fastq | csvtk sep -n 'sample,sequence,L001,suffix' --merge -f 1 -s '_' | csvtk cut -f fastq,sample,suffix | csvtk mutate2 -n 'output' -e '${sample} + "_" + ${suffix}' | csvtk pretty | head
fastq                          sample   suffix            output
----------------------------   ------   ---------------   -------------------
E_S1_L001_R1_001.fastq.gz      E        R1_001.fastq.gz   E_R1_001.fastq.gz
E0_S1_L001_R2_001.fastq.gz     E0       R2_001.fastq.gz   0_R2_001.fastq.gz
A1_S1_L001_R1_001.fastq.gz     A1       R1_001.fastq.gz   A1_R1_001.fastq.gz
A1_S1_L001_R2_001.fastq.gz     A1       R2_001.fastq.gz   A1_R2_001.fastq.gz
B1_S2_L001_R1_001.fastq.gz     B1       R1_001.fastq.gz   B1_R1_001.fastq.gz
B1_S2_L001_R2_001.fastq.gz     B1       R2_001.fastq.gz   B1_R2_001.fastq.gz
C1_S3_L001_R1_001.fastq.gz     C1       R1_001.fastq.gz   C1_R1_001.fastq.gz
C1_S3_L001_R2_001.fastq.gz     C1       R2_001.fastq.gz   C1_R2_001.fastq.gz

Here are all of the E rows:

$ cat 96_well_sample.txt | csvtk add-header -n fastq | csvtk sep -n 'sample,sequence,L001,suffix' --merge -f 1 -s '_' | csvtk cut -f fastq,sample,suffix | csvtk mutate2 -n 'output' -e '${sample} + "_" + ${suffix}' | csvtk pretty | grep -E '^E'
E_S1_L001_R1_001.fastq.gz      E        R1_001.fastq.gz   E_R1_001.fastq.gz
E0_S1_L001_R2_001.fastq.gz     E0       R2_001.fastq.gz   0_R2_001.fastq.gz
E1_S5_L001_R1_001.fastq.gz     E1       R1_001.fastq.gz   0_R1_001.fastq.gz
E1_S5_L001_R2_001.fastq.gz     E1       R2_001.fastq.gz   0_R2_001.fastq.gz
E2_S13_L001_R1_001.fastq.gz    E2       R1_001.fastq.gz   0_R1_001.fastq.gz
E2_S13_L001_R2_001.fastq.gz    E2       R2_001.fastq.gz   0_R2_001.fastq.gz
E3_S21_L001_R1_001.fastq.gz    E3       R1_001.fastq.gz   0_R1_001.fastq.gz
E3_S21_L001_R2_001.fastq.gz    E3       R2_001.fastq.gz   0_R2_001.fastq.gz
E4_S29_L001_R1_001.fastq.gz    E4       R1_001.fastq.gz   0_R1_001.fastq.gz
E4_S29_L001_R2_001.fastq.gz    E4       R2_001.fastq.gz   0_R2_001.fastq.gz
E5_S37_L001_R1_001.fastq.gz    E5       R1_001.fastq.gz   0_R1_001.fastq.gz
E5_S37_L001_R2_001.fastq.gz    E5       R2_001.fastq.gz   0_R2_001.fastq.gz
E6_S45_L001_R1_001.fastq.gz    E6       R1_001.fastq.gz   0_R1_001.fastq.gz
E6_S45_L001_R2_001.fastq.gz    E6       R2_001.fastq.gz   0_R2_001.fastq.gz
E7_S53_L001_R1_001.fastq.gz    E7       R1_001.fastq.gz   0_R1_001.fastq.gz
E7_S53_L001_R2_001.fastq.gz    E7       R2_001.fastq.gz   0_R2_001.fastq.gz
E8_S61_L001_R1_001.fastq.gz    E8       R1_001.fastq.gz   0_R1_001.fastq.gz
E8_S61_L001_R2_001.fastq.gz    E8       R2_001.fastq.gz   0_R2_001.fastq.gz
E9_S69_L001_R1_001.fastq.gz    E9       R1_001.fastq.gz   0_R1_001.fastq.gz
E9_S69_L001_R2_001.fastq.gz    E9       R2_001.fastq.gz   0_R2_001.fastq.gz
E10_S77_L001_R1_001.fastq.gz   E10      R1_001.fastq.gz   0_R1_001.fastq.gz
E10_S77_L001_R2_001.fastq.gz   E10      R2_001.fastq.gz   0_R2_001.fastq.gz
E11_S85_L001_R1_001.fastq.gz   E11      R1_001.fastq.gz   0_R1_001.fastq.gz
E11_S85_L001_R2_001.fastq.gz   E11      R2_001.fastq.gz   0_R2_001.fastq.gz
E12_S93_L001_R1_001.fastq.gz   E12      R1_001.fastq.gz   0_R1_001.fastq.gz
E12_S93_L001_R2_001.fastq.gz   E12      R2_001.fastq.gz   0_R2_001.fastq.gz

I've used this software quite a bit, but this is the first large bug I've found. I can use awk in the meantime to join the outputs.

Thank you for this software.

shenwei356 commented 1 year ago

It's a bug, E1 was wrongly treated as a number in scientific notation. With the old version, you can also switch on the -s, --numeric-as-string to avoid this. Anyway, I've fixed this.

Please use the new binaries here

davised commented 1 year ago

Ah scientific notation, of course. Thanks! This tool is so cool and makes pipelining very intuitive.

Thanks for the very prompt bug fixes!