shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
999 stars 84 forks source link

`csvtk pretty` with less takes long to render #209

Closed Liooo closed 1 year ago

Liooo commented 1 year ago

When I run cat huge_file.csv | less it shows first N results immediately, but when I run cat huge_file.csv | csvtk pretty | less, it take long to get the output. Probably this is unix pipe buffer sizing thing?

shenwei356 commented 1 year ago

Just because csvtk pretty needs load all the data. Try to use head -n 1000 huge_file.csv | csvtk pretty | less.

Liooo commented 1 year ago

Thanks for the quick response.

Just because csvtk pretty needs load all the data.

But it doesn't have to, does it?

Try to use ...

I've been doing head , but everytime it makes me think it'd be much nicer if pipe works out of the box.

Liooo commented 1 year ago

But it doesn't have to, does it?

oh to determine the column width, right. Feels like could be worked around by using fixed column width and ellipse-ing the longer texts, when an option is given or something.

shenwei356 commented 1 year ago

Yes, it's on the to-do list. https://github.com/shenwei356/csvtk/issues/206

Liooo commented 1 year ago

oh so -W is already there, then this feature should be ready to be developed, am I correct?

206 seems like it's about text wrapping and not really related with pre-determining the column width I assume.

Can't promise the ETA, but would you accept the PR if I made one? say the signature is something like:

csvtk --pipe (or -P) # utilizes unix pipe buffer for large files, uses `-W 10` internally by default
csvtk --pipe -W 30   # when changing the width from default `-W 10`
shenwei356 commented 1 year ago

oh so -W is already there, then this feature should be ready to be developed, am I correct?

not started yet.

206 seems like it's about text wrapping and not really related with pre-determining the column width I assume.

They are related


Here's my plan.

  1. Reading first N (says 100) lines to determine the max width for each column.
    • Widths greater than "-w" are set to "-w".
  2. Formating these N lines.
  3. Formating later lines.
    • If the widths of some columns exceed the pre-determined value, wrap the content to multiple lines.

It can be applied to streaming data from the standard input pipe or any file.

Liooo commented 1 year ago

If the widths of some columns exceed the pre-determined value, wrap the content to multiple lines.

I think this should be applied only when --wrap option is specified, otherwise the text should be cut off at -W length. Often times for readability, we don't want one csv row to span multiple columns.

shenwei356 commented 1 year ago

Hmm, that makes sense. But we need to read the file twice or hold the data in memory (the current way).

shenwei356 commented 1 year ago

Implemented. The output is streaming now, you can pipe to other tools like more or less.

Please check here: https://github.com/shenwei356/csvtk/issues/206#issuecomment-1609358555

How to:
  1. First -n/--buf-rows rows are read to check the minimum and maximum widths
     of each column. You can also set the global thresholds -w/--min-width and
     -W/--max-width.
     1a. Cells longer than the maximum width will be wrapped (default) or
         clipped (--clip).
         Usually, the text is wrapped in space (-x/--wrap-delimiter). But if one
         word is longer than the -W/--max-width, it will be force split.
     1b. Texts are aligned left (default), center (-m/--align-center)
         or right (-r/--align-right).
  2. Remaining rows are read and immediately outputted, one by one, till the end.
Liooo commented 1 year ago

@shenwei356

thanks so much 🚀