shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
992 stars 84 forks source link

csvtk spread doesn't properly aggregate values for duplicated keys #248

Closed MostafaYA closed 11 months ago

MostafaYA commented 11 months ago

Prerequisites

Describe your issue

csvtk spread does not collpase duplicated values when exits

$ cat test.txt 
sample  contig  gene    coverage 
sample1 contig_1        a       100
sample1 contig_2        a       70
sample2 contig_2        a       100
sample3 contig_3        a       100
$ csvtk cut -t -f 1,3,4 test.txt | csvtk spread  -t -k 2 -v 3
[WARN] duplicated record: a (70) for sample1 at line 3
sample  a
sample1 100
sample2 100
sample3 100

Desired output

sample  a
sample1 100;70 
sample2 100
sample3 100

BTW: there is also some typos in flags

-f should be -k or -v in the examples

  -h, --help           help for spread
  -k, --key string     field of keys. e.g -f 1,2 or -f columnA,columnB, or -f -columnA for unselect columnA
      --na string      content for filling NA data
  -v, --value string   field of values. e.g -f 1,2 or -f columnA,columnB, or -f -columnA for unselect columnA

Thank you

I'm grateful to users who have greatly helped to report bugs and suggested new features.

I may respond to issues or fix bugs quickly, but I usually implement new features periodically (two or more weeks).

shenwei356 commented 11 months ago

Using fold?

$ cat test.txt | csvtk fold -t -f sample,gene -v coverage
sample  gene    coverage
sample1 a       100; 70
sample2 a       100
sample3 a       100
MostafaYA commented 11 months ago

fold will not convert into a matrix if I have hits for several genes. Here is another example

$ cat test.txt
sample  contig  gene    coverage
sample1 contig_1        a       100
sample1 contig_2        a       70
sample2 contig_2        a       100
sample3 contig_3        a       100
sample3 contig_3        b       100
$ cat test.txt | csvtk fold -t -f sample,gene -v coverage
sample  gene    coverage
sample1 a       100; 70
sample2 a       100
sample3 a       100
sample3 b       100
$ csvtk cut -t -f 1,3,4 test.txt | csvtk spread  -t -k 2 -v 3 --na "-" | csvtk pretty -t 
[WARN] duplicated record: a (70) for sample1 at line 3
sample    a     b  
-------   ---   ---
sample1   100   -  
sample2   100   -  
sample3   100   100
MostafaYA commented 11 months ago

Ok now I got it using foldthen spread?

$ cat test.txt | csvtk fold -t -f sample,gene -v coverage | csvtk spread  -t -k 2 -v 3 --na "-" | csvtk pretty -t
sample    a         b  
-------   -------   ---
sample1   100; 70   -  
sample2   100       -  
sample3   100       100
MostafaYA commented 11 months ago

Note: You should, however, rework the warning message of the spread command, as it simply ignores the values and this can lead to misleading results. You can either raise it to an error or integrate value folding into this command.

shenwei356 commented 11 months ago

I choose to fold the values, instead of reporting an error as the tidyr::spread does.

$ csvtk cut -t -f 1,3,4 test.txt \
    | csvtk spread  -t -k 2 -v 3 \
    | csvtk pretty -t -S bold
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━┓
┃ sample  ┃ a       ┃ b   ┃
┣━━━━━━━━━╋━━━━━━━━━╋━━━━━┫
┃ sample1 ┃ 100; 70 ┃     ┃
┣━━━━━━━━━╋━━━━━━━━━╋━━━━━┫
┃ sample2 ┃ 100     ┃     ┃
┣━━━━━━━━━╋━━━━━━━━━╋━━━━━┫
┃ sample3 ┃ 100     ┃ 100 ┃
┗━━━━━━━━━┻━━━━━━━━━┻━━━━━┛

Maybe you can also compute the mean with summary before calling spread.

$ csvtk summary -t  -g sample,gene -f coverage:mean -w 0 test.txt  \
    | csvtk spread -t -k gene -v coverage:mean \
    | csvtk pretty -t -S bold
┏━━━━━━━━━┳━━━━━┳━━━━━┓
┃ sample  ┃ a   ┃ b   ┃
┣━━━━━━━━━╋━━━━━╋━━━━━┫
┃ sample1 ┃ 85  ┃     ┃
┣━━━━━━━━━╋━━━━━╋━━━━━┫
┃ sample2 ┃ 100 ┃     ┃
┣━━━━━━━━━╋━━━━━╋━━━━━┫
┃ sample3 ┃ 100 ┃ 100 ┃
┗━━━━━━━━━┻━━━━━┻━━━━━┛