shenwei356 / csvtk

A cross-platform, efficient and practical CSV/TSV toolkit in Golang
http://bioinf.shenwei.me/csvtk
MIT License
992 stars 84 forks source link

sort x before plot line #280

Open janxkoci opened 1 month ago

janxkoci commented 1 month ago

I noticed a problem with line plots and found that it can be easily solved by pre-sorting data used for X axis. Since no other package or plotting library (like R base or ggplot2) asks me to sort data before plotting, I think it should be done automatically also in csvtk.

reproducible examples

First, let's generate some example data:

$ seq 1 10 | sort -R | awk '{print $1 "," NR}' > csvtk_plot_xorder_bug.csv
$ cat csvtk_plot_xorder_bug.csv
9,1
5,2
7,3
1,4
2,5
6,6
4,7
10,8
3,9
8,10

Now, we can try some plots.

wrong plots

Using the following code:

csvtk -H plot line -x 1 -y 2 csvtk_plot_xorder_bug.csv > csvtk_plot_xorder_xunsorted.png

we get this plot:

csvtk_plot_xorder_xunsorted

That's obviously wrong. But it still happens to be sorted by the Y axis - we can mess it up even more with random sort using the following code:

sort -R csvtk_plot_xorder_bug.csv | csvtk -H plot line -x 1 -y 2 > csvtk_plot_xorder_xrandom.png

csvtk_plot_xorder_xrandom

good plot

The solution is to just sort the data on X axis before plotting:

csvtk -H sort -k 1:n csvtk_plot_xorder_bug.csv | csvtk -H plot line -x 1 -y 2 > csvtk_plot_xorder_xsorted.png

And the result:

csvtk_plot_xorder_xsorted

This is much nicer and should be done by default whenever we want to make line plots, just like R or ggplot2 do.

notes

Originally, I came across the issue when making real plots such as this one:

sgdp_nd1x

Only much later I found sorting the data fixes the issue, which was not obvious, so I instead had to resort to scatterplots, which are not always good fit.