mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.49k stars 65 forks source link

missing support for comma as a decimal separator (SUM, AVG return NULL) #90

Open githubuser181226 opened 1 year ago

githubuser181226 commented 1 year ago

Hi. I'm sad to report the SUM function doesn't work as expected with floating point values.

Consider this example.csv

Col1;Col2;Col3 1;"Text1";1,01 2;"Text2";2,02 3;"Text3";3,03

When running csvq -d ; "select min(Col3),max(Col3),sum(Col3) from example"

I get: +-----------+-----------+-----------+ | MIN(Col3) | MAX(Col3) | SUM(Col3) | +-----------+-----------+-----------+ | 1,01 | 3,03 | NULL | +-----------+-----------+-----------+`

Now consider example2.csv where I replaces commas for dots as floating point separator:

Col1;Col2;Col3 1;"Text1";1.01 2;"Text2";2.02 3;"Text3";3.03

When running csvq -d ; "select min(Col3),max(Col3),sum(Col3) from example2"

now I get +-----------+-----------+---------------------+ | MIN(Col3) | MAX(Col3) | SUM(Col3) | +-----------+-----------+---------------------+ | 1.01 | 3.03 | 6.0600000000000005 | +-----------+-----------+---------------------+

Both examples give wrong results. What is happening?

Please advise.

ondohotola commented 1 year ago

Decimals have '.' instead of ','

el

-- Dr. Eberhard W. Lisse \ / Obstetrician & Gynaecologist @. / | Telephone: +264 81 124 6733 (cell) PO Box 8421 Bachbrecht \ / If this email is signed with GPG/PGP 10007, Namibia ;____/ Sect 20 of Act No. 4 of 2019 may apply On 26. Nov 2022 at 20:54 +0200, githubuser181226 **@.***>, wrote:

Hi. I'm sad to report the SUM function doesn't work as expected with floating point values. Consider this simple example.csv: Col1;Col2;Col3 1;"Text1";1,01 2;"Text2";2,02 3;"Text3";3,03 When running csvq -d ; "select min(Col3),max(Col3),sum(Col3) from example" I get: +-----------+-----------+-----------+ | MIN(Col3) | MAX(Col3) | SUM(Col3) | +-----------+-----------+-----------+ | 1,01 | 3,03 | NULL | +-----------+-----------+-----------+ What am I doing wrong? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

kpym commented 1 year ago

1.01 + 2.02 + 3.03 = 6.0600000000000005 is the expected result with floating numbers in IEEE 754. This is the same in (almost) all programing languages. You can check this page for python.

You can use for example round(sum(Col3),10) in place of sum(Col3) if you want a decimal with ten digits precision.

githubuser181226 commented 1 year ago

I understand, but I'm confused. Why MIN and MAX work? This would suggest, MIN and MAX do some string comparisons instead of "math". This would explain why these two work, but not SUM nor AVG is. Am I correct?

I get the floating point precision problem, but I thought it was weird nontheless and I can imagine this getting quite problematic when dealing with any amount of accounting data. Imaging searching for that missing/extra cent in a thousands of records... Is there an option to set some implicit floating point values rounding/precision globally, so this weird result don't happen, just like Excel does it?

mithrandie commented 1 year ago

The aggregate functions MIN and MAX can be used with values that can be compared with < and >. In this case, they are compared as strings. However, the aggregate functions SUM and AVG can only be used with floating-point numbers.

The floating-point precision problem is very troublesome to calculate programmatically. Similar problems occur in Excel as well. The following screenshot shows a comparison of floating-point numbers in Excel.

Values:

A1 and C1 are formulas.

Screen Shot 2022-11-27 at 9 10 57

The value of C1 indicates that A1 and B1 appear to be the same number, but internally they are different numbers. As far as I know, there is no way to completely solve this floating-point problem.

githubuser181226 commented 1 year ago

The aggregate functions MIN and MAX can be used with values that can be compared with < and >. In this case, they are compared as strings. However, the aggregate functions SUM and AVG can only be used with floating-point numbers.

I see. Thanks for the explanation.

The floating-point precision problem is very troublesome to calculate programmatically. Similar problems occur in Excel as well. The following screenshot shows a comparison of floating-point numbers in Excel.

True. I know that, but I can limit the precision for the floating point globally and get rid of the imprecision that way. Przechwytywanie

This is LibreOffice Calc and in the options I've set the limit of fractional digits to 4 and the problem disappeared.

Thus I'm asking if there is a way to do exactly that in csvq/SQL without placing ROUND() everywhere, which is PITA not only to do it, but to remember when and where to do it.

Cheers.

mithrandie commented 1 year ago

There is no way to set the precision globally. Users must explicitly use ROUND() if necessary.

kpym commented 1 year ago

@mithrandie If #91 is considered, perhaps the output precision, like the output number delimiters, can be considered part of the "configuration". This won't solve the problem of internal comparison, as you indicated in the example with Excel, but it may be helpful.

ondohotola commented 1 year ago

Pipe through something like QSV?

In the first instance something like

qsv apply operations currencytoenum columnname

comes to mind.

el

-- Dr. Eberhard W. Lisse \ / Obstetrician & Gynaecologist @. / | Telephone: +264 81 124 6733 (cell) PO Box 8421 Bachbrecht \ / If this email is signed with GPG/PGP 10007, Namibia ;____/ Sect 20 of Act No. 4 of 2019 may apply On 27. Nov 2022 at 08:45 +0200, githubuser181226 **@.***>, wrote:

The aggregate functions MIN and MAX can be used with values that can be compared with < and >. In this case, they are compared as strings. However, the aggregate functions SUM and AVG can only be used with floating-point numbers. I see. Thanks for the explanation. The floating-point precision problem is very troublesome to calculate programmatically. Similar problems occur in Excel as well. The following screenshot shows a comparison of floating-point numbers in Excel. True. I know that, but I can limit the precision for the floating point globally and get rid of the imprecision that way. This is LibreOffice Calc and in the options I've set the limit of fractional digits to 4 and the problem disappeared. Thus I'm asking if there is a way to do exactly that in csvq/SQL without placing ROUND() everywhere, which is PITA not only to do it, but to remember when and where to do it. Cheers. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>