Closed derekmahar closed 4 years ago
Absolutely, sorting in csvq is not stable. But as far as I know, there is no RDBMS that guarantees that the sorting is stable.
When using SQL, if the data you want to operate does not have any unique keys, you can preserve the original data order by using the analytic function ROW_NUMBER.
SELECT ROW_NUMBER() OVER () AS n,
timestamp,
name,
value
FROM `test.csv`
ORDER BY timestamp, n;
Absolutely, sorting in csvq is not stable. But as far as I know, there is no RDBMS that guarantees that the sorting is stable.
It makes sense that an RDBMS would not implement stable sorting because the default order of rows in an RDBMS table is not defined. However, in a conventional CSV file, stable sorting makes more sense because the default row order is the same as the line insertion order. Nevertheless, I understand why csvq
might intentionally dismiss or ignore the regular line order in a CSV file in order to treat a CSV file more like an RDBMS table.
When using SQL, if the data you want to operate does not have any unique keys, you can preserve the original data order by using the analytic function ROW_NUMBER.
SELECT ROW_NUMBER() OVER () AS n, timestamp, name, value FROM `test.csv` ORDER BY timestamp, n;
Does the SQL standard require that
ROW_NUMBER()
order match row insertion order? If not, then neitherROW_NUMBER()
nor stable sort would be sufficient to preserve the temporal order of rows that have matching timestamps using different database implementations. The only way that I could guarantee the temporal order of these rows would be to introduce a sequence number as a surrogate timestamp that increments each time a client inserts a new row in the table. Unfortunately, not only would this sequence number add a column to the CSV file and take more space, but it would also be awkward to maintain. Nevertheless, I think sorting the rows in a table on this sequence number may be the only way that I could display the rows in identical temporal order in any RDBMS.
In many RDBMSs, the insertion order is not preserved unless a specific column is prepared and the data is explicitly inserted, but the order at a particular point can be preserved by using ROW_NUMBER. And in the csvq, simple queries, such as not sorting, preserve the order of the original files, so using ROW_NUMBER guarantees that the order is preserved.
You can also use a subquery in a one-time query, or use the CREATE SELECT syntax if you want to keep it permanently.
In any case, it doesn't require complicated operations. Changing the sorting algorithm also affects the performances, so I don’t feel that need to change its specifications for a problem that can be solved easily with sql.
You can also use a subquery in a one-time query, or use the CREATE SELECT syntax if you want to keep it permanently.
What is the CREATE SELECT syntax? Do you mean Create from the Result-Set of a Select Query?
Yes, it is the syntax.
In many RDBMSs, the insertion order is not preserved unless a specific column is prepared and the data is explicitly inserted, but the order at a particular point can be preserved by using ROW_NUMBER.
An example of the former might be the separate sequence number column that I described earlier.
And in the csvq, simple queries, such as not sorting, preserve the order of the original files, so using ROW_NUMBER guarantees that the order is preserved.
Right, so in csvq
, due to the line-oriented structure of CSV files, ROW_NUMBER()
has the same effect as a separate number column.
You can also use a subquery in a one-time query, or use the CREATE SELECT syntax if you want to keep it permanently.
Yes, or even a Common Table Expression.
In any case, it doesn't require complicated operations. Changing the sorting algorithm also affects the performances, so I don’t feel that need to change its specifications for a problem that can be solved easily with sql.
ROW_NUMBER() OVER ()
may look simple, but it is actually more advanced SQL window function syntax that may be less familiar to many SQL programmers.mlr sort
implements a fast and stable sorting algorithm using Quicksort even though Quicksort is unstable. Does mlr sort
use the row number trick internally to extend the sort key?--stable-sort
option would still be nice to have if for no other reason than it would guarantee that the results of an ordinary ORDER BY
query would match those of the stable sort operation of other CSV processors like Miller.It sure looks like that Miller use quick short with external row number. And if that's the case, it's no different from sorting with ROW_NUMBER in csvq.
The command options of csvq are basically related to input and output, and the internal processing is made to work independently. A command option affects the entire sequence of processing, so l think that the command option that is useful only certain cases is not appropriate for csvq, which can process multiple statements at once.
I don't think ROW_NUMBER is so advanced, and I don’t implement it at this time, but I’ll keep the suggestion in mind.
ORDER BY
sort is not stable meaning that for rows that have the same sort key,ORDER BY
incsvq
does not preserve the order of these input rows in the output. This is a problem, for example, for applications that involve financial transactions that require the temporal order of transactions to be preserved.Note that
sort --stable
from GNU coreutils andmlr sort
generate sort results that are stable, butORDER BY
incsvq
changes the order of rows having the same key.(Input data is available in
test.csv
.)