shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
999 stars 84 forks source link

`xlsx2csv` splits a row into two lines #187

Closed apcamargo closed 2 years ago

apcamargo commented 2 years ago

I'm not entirely sure what is causing this to happen. But here are the steps to reproduce:

aria2c -o ictv.xlsx https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/13426/download
csvtk xlsx2csv ictv.xlsx > ictv.csv

Then, in lines 4375 and 4376:

4374,1,Monodnaviria,,Shotokuvirae,,Cossaviricota,,Quintoviricetes,,Piccovirales,,Parvoviridae,Hamaparvovirinae,Chaphamaparvovirus,,Dasyurid chaphamaparvovirus 1,E,Tasmanian devil-associated chapparvovirus 1,TdChPV1,"Tasmania/Sarcophilus_harrisii/2017/frag_3871_SRR8
 048111",MK513528,Complete coding genome,ssDNA(+/-),vertebrates

This should be a single line

shenwei356 commented 2 years ago

The original .xlsx file introduces a newline character and a space by accident in the line 4375:

image

4375 Tasmania/Sarcophilus_harrisii/2017/frag_3871_SRR8
 048111
4376 Tasmania/Sarcophilus_harrisii/2017/frag_4262_SRR8048117
4377 Tasmania/Sarcophilus_harrisii/2017/frag_4482_SRR8048117

Luckily, the CSV parser used by csvtk is tolerant of this with support of CSV values of multiple lines. So just remove the \n characters.

csvtk xlsx2csv ictv.xlsx | csvtk replace -F -f "*" -p "\n " -r "" > ictv.csv

PS: I also found other unwanted characters. e.g., M-BM- characters, which were handled in https://github.com/shenwei356/ictv-taxdump#steps

apcamargo commented 2 years ago

Thanks, @shenwei356!