r-dbi / RMariaDB

An R interface to MariaDB
https://rmariadb.r-dbi.org
Other
130 stars 40 forks source link

Using RMariaDB with Columnstore Tables #183

Open Mosk915 opened 3 years ago

Mosk915 commented 3 years ago

I am trying to use the RMariaDB package to query MariaDB columnstore tables. Queries that should take seconds are taking much longer.

For example a simple query like dbGetQuery(conn, "select count(*) from example_table"), where example_table has about 300 million rows, takes under a second using other IDEs or using the odbc R package. But when using RMariaDB it takes much longer.

Is there some setting or option that needs to be changed to make the RMariaDB package perform optimally with columnstore tables?

krlmlr commented 3 years ago

Thanks, this is really weird (and doesn't look like a duplicate of #131). Is this still a problem? Is the client doing any work (=full load on one or more CPU cores)? How can I replicate this?

Mosk915 commented 3 years ago

Yes, still a problem. Our DBAs took a look at this and noticed that when queries are executed using RMariaDB, it is not multi-threading. They ran the vmstat command on the MariaDB host and saw that the CPU idle time only dips slightly below 100. When the queries are executed using the odbc package or through some other IDE, the idle time dips into the single digits.

As far as how to replicate this, I'm not sure. When you run a select count(*) on a columnstore table with several hundred million rows, does the result come back within seconds or take several minutes? We use RStudio Server Pro, but I've also tested this using RStudio on my local Windows machine and I have the same issue, which is why we are thinking it has something to do with the RMariaDB package. If you don't have the same issue, I'm hoping it's just a matter of us setting some configuration that allows the queries to multi-thread when they are executed.

krlmlr commented 3 years ago

IIUC, the database host has multiple CPUs, and only one is utilized by SELECT COUNT(*) when issued from RMariaDB.

According to https://mariadb.com/kb/en/mariadb-columnstore-performance-concepts/, SELECT COUNT(*) performs a scan over one column. This explains why multiple processors work faster on the task.

One thing RMariaDB does differently is the use of parametrized queries throughout. I have no evidence that this might be the actual reason, it's worth a try. Would you be able to run a source installation when #147 is implemented?

Mosk915 commented 3 years ago

Yes I can run a source installation. Just let me know when.

Just to clarify, select count(*) was just a simple example of a query that is slower when using RmariaDB, but it's really any query run against columnstore tables that is slower. In order to reproduce an example using a more complicated query, you'd need tables of a sufficiently large size so that the query doesn't complete instantaneously. I'm not sure how to go about providing a reproducible example like that, but in the event you find a solution for select count(*) queries but not all queries in general, I can look into it more at that time.

krlmlr commented 3 years ago

Blocked by #147 which I had to postpone.