mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.5k stars 65 forks source link

Incorrect or inconsistent justification in table column. #13

Closed derekmahar closed 4 years ago

derekmahar commented 4 years ago

For the following recursive CTE query and CSV input file, csvq incorrectly or inconsistently right justifies column TotalQuantity in all but the first row of the result set, but correctly (or consistently) left justifies the numbers in the first two columns.

Should csvq left or right justify numbers in the columns of a result set?

Query select-running-total-quantity.sql:

WITH ordered_quantities AS (
  SELECT *
  FROM quantities
  ORDER BY Timestamp
),
RECURSIVE total_quantities(Timestamp, Quantity, TotalQuantity) AS (
  SELECT
    Timestamp,
    Quantity,
    Quantity
  FROM ordered_quantities
  WHERE Timestamp = 1
  UNION ALL
  SELECT
    q.Timestamp,
    q.Quantity,
    q.Quantity + tq.TotalQuantity
  FROM quantities q
  JOIN total_quantities tq
    ON tq.Timestamp = q.Timestamp - 1
)
SELECT *
FROM total_quantities;

Input file quantities.csv:

Timestamp,Quantity
1,123
2,67
3,23
4,6
5,89
6,27
7,105

Result set:

$ csvq --source select-running-total-quantity.sql
+-----------+----------+---------------+
| Timestamp | Quantity | TotalQuantity |
+-----------+----------+---------------+
| 1         | 123      | 123           |
| 2         | 67       |           190 |
| 3         | 23       |           213 |
| 4         | 6        |           219 |
| 5         | 89       |           308 |
| 6         | 27       |           335 |
| 7         | 105      |           440 |
+-----------+----------+---------------+

Notice that in column TotalQuantity only the first row is left justified as are all of the numbers in columns Timestamp and Quantity.

More importantly, however, the good news the result of the query is correct!

mithrandie commented 4 years ago

The behavior is as expected by csvq. In some cases, in the result set of csvq, one column may contain different types.

In the processing of csvq, all values in csv files are initially read as strings. The value types are converted only if the values are calculated. To prevent unexpected type conversion for users, implicit type conversion is not performed except in that case.

In this query, TotalQuantity of the first row and other columns are specified with only the column name in the file quantities.csv, so these values are displayed as string values. The values of second row and after in the column Total Quantity are converted to integer values by the expression q.Quantity + tq.TotalQuantity, so these values are right justified.

To unify the value types in the result set, you can use cast functions, such as INTEGER(), in the select clause.

WITH ordered_quantities AS (
  SELECT *
  FROM quantities
  ORDER BY Timestamp
),
RECURSIVE total_quantities(Timestamp, Quantity, TotalQuantity) AS (
  SELECT
    Timestamp,
    Quantity,
    Quantity
  FROM ordered_quantities
  WHERE Timestamp = 1
  UNION ALL
  SELECT
    q.Timestamp,
    q.Quantity,
    q.Quantity + tq.TotalQuantity
  FROM quantities q
  JOIN total_quantities tq
    ON tq.Timestamp = q.Timestamp - 1
)
SELECT
  Timestamp,
  Quantity,
  INTEGER(TotalQuantity) AS TotalQuantity
FROM total_quantities;
derekmahar commented 4 years ago

Thank you for your answer.

How can one specify the type of a column in a CSV?

derekmahar commented 4 years ago

I modified my query slightly differently than you did in your solution such that I impose the column types in the first table in the CTE and then sort the rows in a second table:

WITH typed_quantities AS (
  SELECT 
    INTEGER(Timestamp) AS Timestamp,
    INTEGER(Quantity) AS Quantity
  FROM quantities
),
ordered_typed_quantities AS (
  SELECT *
  FROM typed_quantities
  ORDER BY Timestamp
),
RECURSIVE total_quantities(Timestamp, Quantity, TotalQuantity) AS (
  SELECT
    Timestamp,
    Quantity,
    Quantity
  FROM ordered_typed_quantities
  WHERE Timestamp = 1
  UNION ALL
  SELECT
    q.Timestamp,
    q.Quantity,
    q.Quantity + tq.TotalQuantity
  FROM ordered_typed_quantities q
  JOIN total_quantities tq
    ON tq.Timestamp = q.Timestamp - 1
)
SELECT *
FROM total_quantities;
mithrandie commented 4 years ago

The types of values are converted when the values are calculated as necessary.

For example, in arithmetic operations, the values will be converted to integer or float. When using a function, the values will be converted to suitable types for that function. If that value cannot be converted to the suitable type, the value will be converted to Null.

e.g.

SELECT DATETIME(q.Timestamp) AS Datetime, -- Datetime
       INTEGER(q.Quantity) AS Quantity, -- Integer
       q.Quantity || ' pieces' AS Pieces, -- String
       q.Quantity / (SELECT SUM(q2.Quantity) FROM `quantities.csv` q2) AS Percentage -- Float
  FROM `quantities.csv` q;

Please refer to the following link for details about type conversion. https://mithrandie.github.io/csvq/reference.html https://mithrandie.github.io/csvq/reference/value.html#automatic_type_casting