shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
992 stars 84 forks source link

Grep and mutate (grepl) #275

Closed iferres closed 3 months ago

iferres commented 3 months ago

I may not be seeing the answer, but I think it is not possible right now. I want to create a new column based on the presence or absence (logical) of a substring inside a string column. With grep I get the matches:

echo -e "lorem,1\nloremipsum,232\nlorem,334\nipsum,214234\nfoobar,4" |
   csvtk -H grep -f1 -r -p  "ipsum"  
# loremipsum,232
# ipsum,214234

With mutate I create a new column with the substring, but if doesn't exists, it returns the whole column:

echo -e "lorem,1\nloremipsum,232\nlorem,334\nipsum,214234\nfoobar,4" | 
   csvtk -H mutate -f1 -p  "(ipsum)" | 
   csvtk -H pretty
# lorem        1        lorem 
# loremipsum   232      ipsum 
# lorem        334      lorem 
# ipsum        214234   ipsum 
# foobar       4        foobar

Would be possible to ask if substring exists, and return it in a logical column? The expected output would be:

# lorem        1        false 
# loremipsum   232      true 
# lorem        334      false 
# ipsum        214234   true 
# foobar       4        false

Or the possibility of leaving blank (or NA/null) cells in case no matches are detected, something like:

# lorem        1        NA 
# loremipsum   232      ipsum 
# lorem        334      NA 
# ipsum        214234   ipsum 
# foobar       4        NA

or just leaving them empty.

shenwei356 commented 3 months ago

There's a --na flag.

      --na               for unmatched data, use blank instead of original data
$ echo -e "lorem,1\nloremipsum,232\nlorem,334\nipsum,214234\nfoobar,4"  \
    | csvtk mutate -H -f1 -p "(ipsum)" --na \
    | csvtk pretty -H
lorem        1             
loremipsum   232      ipsum
lorem        334           
ipsum        214234   ipsum
foobar       4

Well, if you'd like a NA.

$ echo -e "lorem,1\nloremipsum,232\nlorem,334\nipsum,214234\nfoobar,4"  \
    | csvtk mutate -H -f1 -p "(ipsum)" --na \
    | csvtk replace -H -f3 -p '^$' -r NA \
    | csvtk pretty -H
lorem        1        NA   
loremipsum   232      ipsum
lorem        334      NA   
ipsum        214234   ipsum
foobar       4        NA 
iferres commented 3 months ago

Awesome, many thanks!