mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.5k stars 65 forks source link

"ORDER BY" sort is not stable. #23

Closed derekmahar closed 4 years ago

derekmahar commented 4 years ago

ORDER BY sort is not stable meaning that for rows that have the same sort key, ORDER BY in csvq does not preserve the order of these input rows in the output. This is a problem, for example, for applications that involve financial transactions that require the temporal order of transactions to be preserved.

$ cat test.csv | sort --field-separator=, --key=1g > test-sort-unstable-unix.csv
$ cat test.csv | sort --field-separator=, --key=1g --stable > test-sort-stable-unix.csv
$ cat test.csv | mlr --csv sort -f timestamp > test-sort-mlr.csv
$ cat test.csv | csvq --format CSV "SELECT * FROM STDIN ORDER BY timestamp" > test-sort-csvq.csv
$ diff --brief --report-identical-files test-sort-stable-unix.csv test-sort-unstable-unix.csv
Files test-sort-stable-unix.csv and test-sort-unstable-unix.csv differ
$ diff --side-by-side --brief --report-identical-files test-sort-stable-unix.csv test-sort-mlr.csv
Files test-sort-stable-unix.csv and test-sort-mlr.csv are identical
$ diff --side-by-side --brief --report-identical-files test-sort-stable-unix.csv test-sort-csvq.csv
Files test-sort-stable-unix.csv and test-sort-csvq.csv differ
$ diff --side-by-side test-sort-stable-unix.csv test-sort-unstable-unix.csv
.
.
.
1463273940,B,0.04629056                                       <
1463273940,B,0.0462813                                          1463273940,B,0.0462813
                                                              > 1463273940,B,0.04629056
.
.
.
$ diff --side-by-side test-sort-stable-unix.csv test-sort-unstable-unix.csv
.
.
.
1454175635.753,J,-0.17188999                                  <
1454175635.753,J,-0.35                                          1454175635.753,J,-0.35
1454175635.753,J,-0.5                                         <
1454175635.753,J,-0.64773132                                    1454175635.753,J,-0.64773132
                                                              > 1454175635.753,J,-0.5
                                                              > 1454175635.753,J,-0.17188999
.
.
.
1463273940,B,0.04629056                                       <
1463273940,B,0.0462813                                          1463273940,B,0.0462813
                                                              > 1463273940,B,0.04629056
.
.
.
$

Note that sort --stable from GNU coreutils and mlr sort generate sort results that are stable, but ORDER BY in csvq changes the order of rows having the same key.

(Input data is available in test.csv.)

mithrandie commented 4 years ago

Absolutely, sorting in csvq is not stable. But as far as I know, there is no RDBMS that guarantees that the sorting is stable.

When using SQL, if the data you want to operate does not have any unique keys, you can preserve the original data order by using the analytic function ROW_NUMBER.

SELECT ROW_NUMBER() OVER () AS n,
       timestamp,
       name,
       value
  FROM `test.csv`
ORDER BY timestamp, n;
derekmahar commented 4 years ago

Absolutely, sorting in csvq is not stable. But as far as I know, there is no RDBMS that guarantees that the sorting is stable.

It makes sense that an RDBMS would not implement stable sorting because the default order of rows in an RDBMS table is not defined. However, in a conventional CSV file, stable sorting makes more sense because the default row order is the same as the line insertion order. Nevertheless, I understand why csvq might intentionally dismiss or ignore the regular line order in a CSV file in order to treat a CSV file more like an RDBMS table.

When using SQL, if the data you want to operate does not have any unique keys, you can preserve the original data order by using the analytic function ROW_NUMBER.

SELECT ROW_NUMBER() OVER () AS n,
       timestamp,
       name,
       value
  FROM `test.csv`
ORDER BY timestamp, n;

Does the SQL standard require that ROW_NUMBER() order match row insertion order? If not, then neither ROW_NUMBER() nor stable sort would be sufficient to preserve the temporal order of rows that have matching timestamps using different database implementations. The only way that I could guarantee the temporal order of these rows would be to introduce a sequence number as a surrogate timestamp that increments each time a client inserts a new row in the table. Unfortunately, not only would this sequence number add a column to the CSV file and take more space, but it would also be awkward to maintain. Nevertheless, I think sorting the rows in a table on this sequence number may be the only way that I could display the rows in identical temporal order in any RDBMS.

mithrandie commented 4 years ago

In many RDBMSs, the insertion order is not preserved unless a specific column is prepared and the data is explicitly inserted, but the order at a particular point can be preserved by using ROW_NUMBER. And in the csvq, simple queries, such as not sorting, preserve the order of the original files, so using ROW_NUMBER guarantees that the order is preserved.

You can also use a subquery in a one-time query, or use the CREATE SELECT syntax if you want to keep it permanently.

In any case, it doesn't require complicated operations. Changing the sorting algorithm also affects the performances, so I don’t feel that need to change its specifications for a problem that can be solved easily with sql.

derekmahar commented 4 years ago

You can also use a subquery in a one-time query, or use the CREATE SELECT syntax if you want to keep it permanently.

What is the CREATE SELECT syntax? Do you mean Create from the Result-Set of a Select Query?

mithrandie commented 4 years ago

Yes, it is the syntax.

derekmahar commented 4 years ago

In many RDBMSs, the insertion order is not preserved unless a specific column is prepared and the data is explicitly inserted, but the order at a particular point can be preserved by using ROW_NUMBER.

An example of the former might be the separate sequence number column that I described earlier.

And in the csvq, simple queries, such as not sorting, preserve the order of the original files, so using ROW_NUMBER guarantees that the order is preserved.

Right, so in csvq, due to the line-oriented structure of CSV files, ROW_NUMBER() has the same effect as a separate number column.

You can also use a subquery in a one-time query, or use the CREATE SELECT syntax if you want to keep it permanently.

Yes, or even a Common Table Expression.

In any case, it doesn't require complicated operations. Changing the sorting algorithm also affects the performances, so I don’t feel that need to change its specifications for a problem that can be solved easily with sql.

mithrandie commented 4 years ago

It sure looks like that Miller use quick short with external row number. And if that's the case, it's no different from sorting with ROW_NUMBER in csvq.

The command options of csvq are basically related to input and output, and the internal processing is made to work independently. A command option affects the entire sequence of processing, so l think that the command option that is useful only certain cases is not appropriate for csvq, which can process multiple statements at once.

I don't think ROW_NUMBER is so advanced, and I don’t implement it at this time, but I’ll keep the suggestion in mind.