mithrandie / csvq

SQL-like query language for csv
https://mithrandie.github.io/csvq
MIT License
1.49k stars 65 forks source link

Support insert with multiple rows #98

Closed josejuanmontiel closed 1 year ago

josejuanmontiel commented 1 year ago

It could be interesting in some scenerios support this extended format

[WITH common_table_expression [, common_table_expression ...]] INSERT INTO table_name [(column [, column ...])] VALUES (row_value [, row_value ...]), (row_value [, row_value ...])...

for the insert clause https://mithrandie.github.io/csvq/reference/insert-query.html#insert-values as a bulk way to speed the insert... (under the hood the write should be done in one shot... i haven't seen yet the code :)

josejuanmontiel commented 1 year ago

Hi, I started playing with YACC... But, seeing some test ... seems that this kind of insert is supported...

row_values
    : row_value
    {
        $$ = []QueryExpression{$1}
    }
    | row_value ',' row_values
    {
        $$ = append([]QueryExpression{$1}, $3...)
    }

i decided to do a benchmark to test the speed, and seems to be the x2 faster inserting 2 rows vs 1?

https://github.com/mithrandie/csvq-driver/pull/7

mithrandie commented 1 year ago

In DBMSs, row-by-row INSERT queries are very slow when inserting large amounts of data due to transaction control. Csvq simply locks the file at insert time, so there may be a slight difference in speed depending on the amount of data, but with a small amount of data, the difference is not that significant.

However, since each commit in csvq writes the changed data to the file, it may be very slow depending on the amount of data and disk performance being used if a commit is performed for each INSERT query.

josejuanmontiel commented 1 year ago

This initial suggestion came from i was playing with sql-jobber (a kind of batch processor that query a db) as exporter of data to csv (csvq-driver a real sql driver that write in csv file/database)... and doing in the batch processor insert with multiple row and not row-by-row inserts it would speed up.

Because i didn't view in docs the options with multiple row, i suggest the issue, but from point of view csvq-driver... its supported, and doing the benchmark

imagen

seems that doing a insert with 2 rows, it's 25% more slow that one row insert, but insert 2 rows :) because of this even with the b.N/2 limitation... the fix time lead to iterate the double of the iterations. And when call to csvq, pass an array of rows, lock the file and insert at once.

And the other bench, don't show difference performance doing tx.commit inside the loop or outside.

I'll close the issue, because it's supported... i'll be learning more about this library... and maybe in the future do another suggestion, but from now i think i could go with my testings.