mdsr-book / mdsr

Complement to CRC Press book *Modern Data Science with R*
39 stars 25 forks source link

SQL code working last week but not this week #75

Closed proback closed 4 months ago

proback commented 4 months ago

This is the code which is being used in a Data Science 2 class at St. Olaf. It ran fine last week but not now. Our best guess is something to do with the DISTINCT inside the COUNT...

SELECT carrier, 
  COUNT(DISTINCT o.tailnum) AS unique_planes,
  AVG(o.year - p.year) AS mean_weighted_age,
  STDDEV_SAMP(o.year - p.year) AS sd_weighted_age
FROM flights AS o
LEFT JOIN planes p ON o.tailnum = p.tailnum
WHERE o.year = 2013 AND (origin = "EWR" OR origin = "JFK" OR origin = "LGA")
GROUP BY carrier
ORDER BY mean_weighted_age ASC

Error: [0] Failed to execute SQL chunk

beanumber commented 4 months ago

Argh. There were some queries that had been running for six days that had eaten up all of the remaining disk space on the server! This query and a few others piled up behind those. Once I killed those, things moved.

I was just able to execute this query in 11 seconds:

MariaDB [airlines]> SELECT carrier,    COUNT(DISTINCT o.tailnum) AS unique_planes,   AVG(o.year - p.year) AS mean_weighted_age,   STDDEV_SAMP(o.year - p.year) AS sd_weighted_age FROM flights AS o LEFT JOIN planes p ON o.tailnum = p.tailnum WHERE o.year = 2013 AND (origin = "EWR" OR origin = "JFK" OR origin = "LGA") GROUP BY carrier ORDER BY mean_weighted_age ASC;
+---------+---------------+-------------------+-----------------+
| carrier | unique_planes | mean_weighted_age | sd_weighted_age |
+---------+---------------+-------------------+-----------------+
| HA      |            14 |            1.5484 |          1.1389 |
| AS      |            84 |            3.3366 |          3.0710 |
| VX      |            53 |            4.4736 |          2.1353 |
| F9      |            26 |            4.8787 |          3.6679 |
| B6      |           193 |            6.6867 |          3.2895 |
| OO      |            28 |            6.8438 |          2.4111 |
| 9E      |           204 |            7.1011 |          2.6696 |
| US      |           290 |            9.1037 |          4.8819 |
| WN      |           583 |            9.1461 |          4.6261 |
| YV      |            58 |            9.3138 |          1.9274 |
| EV      |           316 |           11.3090 |          2.2894 |
| FL      |           129 |           11.3858 |          2.1611 |
| UA      |           621 |           13.2077 |          5.8335 |
| DL      |           629 |           16.3722 |          5.4899 |
| AA      |           601 |           25.8694 |          5.4165 |
| MQ      |           238 |           35.3190 |          3.1329 |
+---------+---------------+-------------------+-----------------+
16 rows in set (11.662 sec)

Sorry for the trouble. Please remind students to kill any queries that take more than few minutes to execute! And you can even do this yourself:

SHOW PROCESSLIST;
KILL <processid>;
nicholasjhorton commented 4 months ago

Ben, many thanks.

Any idea what caused the issue with the stalled queries?

beanumber commented 4 months ago

I didn't look carefully at the queries. But my guess is they were trying to join two (or more) large tables.

My understanding is that when you ask for a lot of data like that, the server starts writing it to temporary tables, which, if they can't be stored in memory, get written to the disk. The RDS instance in question has neither a lot of memory nor a lot of storage space, and so I guess that the temporary tables just ate up all of the available space on disk. Once I killed the queries, the temporary tables evaporated and the disk storage issue was resolved.

proback commented 4 months ago

Ben - thanks so much for looking into this! As someone without lengthy experience in SQL who's having his class query databases through RStudio, what's the easiest way to have them kill a query that is taking too long? Using SHOW PROCESSLIST and KILL ? Or is there an easier way? In a related question, what advice do you give students about how often to use dbDisconnect()? Thanks!

beanumber commented 4 months ago

what's the easiest way to have them kill a query that is taking too long?

It depends on what your interface is. For SQL, I've been using DBeaver, which has a feature that lets you see the process list graphically, and then you can kill queries by clicking on them. As the instructor you can also use this even if the students aren't using it.

I'm not sure exactly what happens when you click on the stop sign in RStudio...

Using SHOW PROCESSLIST and KILL ?

This is the fail safe. It should always work!

In a related question, what advice do you give students about how often to use dbDisconnect()?

None, I'm embarrassed to admit. :(

proback commented 4 months ago

I hate to continue to cause problems, but I'm actually still unable to run my sql chunks. Things spin and spin until I finally have to shut down my session (sounds like the inability to kill queries from inside RStudio is a known issue: https://github.com/rstudio/rstudio/issues/6287). I attached my qmd file in case that provides insight - it had been working fine late last week... Thanks again!

P.S. Whoops - apparently I can't attach a qmd file. I'll send it via email...

proback commented 4 months ago

Here's the file I promised in my latest reply...

On Wed, Apr 24, 2024 at 3:06 PM Ben Baumer @.***> wrote:

what's the easiest way to have them kill a query that is taking too long?

It depends on what your interface is. For SQL, I've been using DBeaver https://dbeaver.io/, which has a feature that lets you see the process list graphically, and then you can kill queries by clicking on them. As the instructor you can also use this even if the students aren't using it.

I'm not sure exactly what happens when you click on the stop sign in RStudio...

Using SHOW PROCESSLIST and KILL ?

This is the fail safe. It should always work!

In a related question, what advice do you give students about how often to use dbDisconnect()?

None, I'm embarrassed to admit. :(

— Reply to this email directly, view it on GitHub https://github.com/mdsr-book/mdsr/issues/75#issuecomment-2075748044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCMFWQGLSKA5HFLESVW6WDY7AGEHAVCNFSM6AAAAABGVQZ2PWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVG42DQMBUGQ . You are receiving this because you authored the thread.Message ID: @.***>

-- Paul Roback Professor of Statistics and Data Science Chair | Department of Mathematics, Statistics, and Computer Science [image: image.png] 1520 St. Olaf Avenue Northfield, MN 55057 Office: 504 Regents Math (507-786-3861 <(507)%20786-3861>) stolaf.edu

Spring 2024 Office Hours: Mon 9:30-10:30, Tues 2-3, Wed 3-4, Fri 3-4 or email me! Zoom Meeting Room https://stolaf.zoom.us/j/4337554265

beanumber commented 4 months ago
MariaDB [(none)]> SHOW PROCESSLIST;
+-----+-------------+----------------------+----------+---------+------+--------------+------------------------------------------------------------------------------------------------------+----------+
| Id  | User        | Host                 | db       | Command | Time | State        | Info                                                                                                 | Progress |
+-----+-------------+----------------------+----------+---------+------+--------------+------------------------------------------------------------------------------------------------------+----------+
| 104 | mdsr_public | 199.91.180.156:55799 | airlines | Execute | 9046 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 126 | mdsr_public | 199.91.180.156:56504 | airlines | Execute | 6914 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 307 | mdsr_public | 199.91.180.156:50161 | airlines | Execute | 3188 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 311 | mdsr_public | 199.91.180.156:51956 | airlines | Execute | 2140 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 331 | mdsr_public | 147.153.249.64:52365 | airlines | Query   |   55 | Sending data | SELECT * FROM flights
WHERE carrier = 'UA' AND month = 7 AND arr_delay > 120
LIMIT 0, 6              |    0.000 |
| 391 | mdsr_public | 190.158.28.81:7291   | airlines | Sleep   |  103 |              | NULL                                                                                                 |    0.000 |
| 392 | mdsr_public | 190.158.28.81:7292   | airlines | Sleep   |  108 |              | NULL                                                                                                 |    0.000 |
| 397 | mdsr_public | 190.158.28.81:7251   | NULL     | Query   |    0 | starting     | SHOW PROCESSLIST                                                                                     |    0.000 |
+-----+-------------+----------------------+----------+---------+------+--------------+------------------------------------------------------------------------------------------------------+----------+
8 rows in set (0.084 sec)

@proback I'm assuming that 199.91.180.156 is you. You've got 4 copies of the same query running, and they keep piling up. I'm guessing this is because of the known issue in RStudio you allude to above.

I don't know why the query didn't run in 11 seconds, as it did for me, but try killing all four of these queries and trying again. If you don't get in answer in about 11 seconds, let us know.

proback commented 4 months ago

I'm having trouble killing the processes. I can see them with PROCESSLIST but KILL or KILL QUERY ID or other options found here (https://mariadb.com/kb/en/kill/) are all producing errors. I wonder if this note below applies to using SQL within RStudio:

Note: You cannot use KILL with the Embedded MariaDB Server library because the embedded server merely runs inside the threads of the host application. It does not create any connection threads of its own.

I apologize for causing confusion - I'm trying to be a good steward of the resource you created, but I'm afraid the inexperience of me and my students may be showing. For a while we were having a blast applying our new SQL knowledge to your airlines database within qmd files, but maybe we're learning limitations of using SQL within RStudio?

nicholasjhorton commented 4 months ago

We're so sorry about the continuing complications.

@beanumber might it be helpful for Paul to test things out with the new server? See https://github.com/mdsr-book/mdsr/commit/b8a8253321d81499c6374c2d3805d5d56299b6d1 for the changes.

The GitHub version of the package can be installed by running

devtools::install_github("mdsr-book/mdsr")
beanumber commented 4 months ago

@proback Check out the following:

bbaumer@pop-os:~$ mariadb -h mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com -p -umdsr_public
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2465
Server version: 10.11.6-MariaDB-log managed by https://aws.amazon.com/rds/

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW PROCESSLIST;
+------+-------------+--------------------------------------+----------+---------+-------+--------------+------------------------------------------------------------------------------------------------------+----------+
| Id   | User        | Host                                 | db       | Command | Time  | State        | Info                                                                                                 | Progress |
+------+-------------+--------------------------------------+----------+---------+-------+--------------+------------------------------------------------------------------------------------------------------+----------+
|  104 | mdsr_public | 199.91.180.156:55799                 | airlines | Execute | 86473 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
|  126 | mdsr_public | 199.91.180.156:56504                 | airlines | Execute | 84341 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
|  307 | mdsr_public | 199.91.180.156:50161                 | airlines | Execute | 80616 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
|  311 | mdsr_public | 199.91.180.156:51956                 | airlines | Execute | 79567 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 1866 | mdsr_public | 199.91.180.29:63556                  | airlines | Execute | 22794 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 2162 | mdsr_public | 199.91.180.27:59367                  | airlines | Execute | 12115 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 2388 | mdsr_public | bb116-14-236-50.singnet.com.sg:56576 | airlines | Sleep   |   287 |              | NULL                                                                                                 |    0.000 |
| 2389 | mdsr_public | bb116-14-236-50.singnet.com.sg:56581 | airlines | Sleep   |   286 |              | NULL                                                                                                 |    0.000 |
| 2460 | mdsr_public | 190.158.28.81:7234                   | airlines | Sleep   |    88 |              | NULL                                                                                                 |    0.000 |
| 2461 | mdsr_public | 190.158.28.81:7237                   | airlines | Sleep   |    89 |              | NULL                                                                                                 |    0.000 |
| 2465 | mdsr_public | 190.158.28.81:16153                  | NULL     | Query   |     0 | starting     | SHOW PROCESSLIST                                                                                     |    0.000 |
+------+-------------+--------------------------------------+----------+---------+-------+--------------+------------------------------------------------------------------------------------------------------+----------+
11 rows in set (0.076 sec)

MariaDB [(none)]> KILL 104;
Query OK, 0 rows affected (0.089 sec)

MariaDB [(none)]> SHOW PROCESSLIST;
+------+-------------+--------------------------------------+----------+---------+-------+--------------+------------------------------------------------------------------------------------------------------+----------+
| Id   | User        | Host                                 | db       | Command | Time  | State        | Info                                                                                                 | Progress |
+------+-------------+--------------------------------------+----------+---------+-------+--------------+------------------------------------------------------------------------------------------------------+----------+
|  126 | mdsr_public | 199.91.180.156:56504                 | airlines | Execute | 84364 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
|  307 | mdsr_public | 199.91.180.156:50161                 | airlines | Execute | 80639 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
|  311 | mdsr_public | 199.91.180.156:51956                 | airlines | Execute | 79590 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 1866 | mdsr_public | 199.91.180.29:63556                  | airlines | Execute | 22817 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 2162 | mdsr_public | 199.91.180.27:59367                  | airlines | Execute | 12138 | Sending data | SELECT
  carrier,
  COUNT(DISTINCT f.tailnum) AS unique_planes,
  AVG(2013 - p.year) AS mean_weighte |    0.000 |
| 2388 | mdsr_public | bb116-14-236-50.singnet.com.sg:56576 | airlines | Sleep   |   309 |              | NULL                                                                                                 |    0.000 |
| 2389 | mdsr_public | bb116-14-236-50.singnet.com.sg:56581 | airlines | Sleep   |   309 |              | NULL                                                                                                 |    0.000 |
| 2460 | mdsr_public | 190.158.28.81:7234                   | airlines | Sleep   |   111 |              | NULL                                                                                                 |    0.000 |
| 2461 | mdsr_public | 190.158.28.81:7237                   | airlines | Sleep   |   112 |              | NULL                                                                                                 |    0.000 |
| 2465 | mdsr_public | 190.158.28.81:16153                  | NULL     | Query   |     0 | starting     | SHOW PROCESSLIST                                                                                     |    0.000 |
+------+-------------+--------------------------------------+----------+---------+-------+--------------+------------------------------------------------------------------------------------------------------+----------+
10 rows in set (0.080 sec)

MariaDB [(none)]> KILL 126; KILL 307; KILL 311; KILL 1866; KILL 2162;
Query OK, 0 rows affected (0.082 sec)

Query OK, 0 rows affected (0.070 sec)

Query OK, 0 rows affected (0.076 sec)

Query OK, 0 rows affected (0.079 sec)

Query OK, 0 rows affected (0.069 sec)

MariaDB [(none)]> SHOW PROCESSLIST;
+------+-------------+--------------------------------------+----------+---------+------+----------+------------------+----------+
| Id   | User        | Host                                 | db       | Command | Time | State    | Info             | Progress |
+------+-------------+--------------------------------------+----------+---------+------+----------+------------------+----------+
| 2388 | mdsr_public | bb116-14-236-50.singnet.com.sg:56576 | airlines | Sleep   |  352 |          | NULL             |    0.000 |
| 2389 | mdsr_public | bb116-14-236-50.singnet.com.sg:56581 | airlines | Sleep   |  352 |          | NULL             |    0.000 |
| 2460 | mdsr_public | 190.158.28.81:7234                   | airlines | Sleep   |  153 |          | NULL             |    0.000 |
| 2461 | mdsr_public | 190.158.28.81:7237                   | airlines | Sleep   |  155 |          | NULL             |    0.000 |
| 2465 | mdsr_public | 190.158.28.81:16153                  | NULL     | Query   |    0 | starting | SHOW PROCESSLIST |    0.000 |
+------+-------------+--------------------------------------+----------+---------+------+----------+------------------+----------+
5 rows in set (0.086 sec)

MariaDB [(none)]> 

@nicholasjhorton The servers are identical, so other than a difference in usage, you shouldn't see any difference in behavior.

proback commented 4 months ago

It worked! Everything ran quickly and beautifully! Thanks so much!

As I get my 22 students back testing their code, any instructions you'd recommend to make sure we don't clog up the database again?

beanumber commented 4 months ago
  1. Ask them to cancel queries if they can (I guess this is moot if they are only using RStudio)
  2. Keep an eye on things as described above. The time listed is in seconds, so if you see anything in the state "Sending data" greater than 600 (10 minutes), just kill it.
proback commented 4 months ago

Uh oh. One of my students seems to have ground the database to a halt again. I can see her queries in PROCESSLIST, but I haven't yet been able to figure out how to KILL jobs from within RStudio. It looks like she forgot a WHERE statement which would have reduced the scope a bit, but I'm not sure if that's enough to cause the spin cycle. I'm thinking it might be best to abandon this assignment to preserve the database for others.

Just so I understand things better, is it typical that one person can grind a database to a halt, or is it possible to build in safeguards? Makes me think I need to learn and then teach methods for checking and killing spinning jobs at the very beginning of the SQL section...

beanumber commented 4 months ago

is it typical that one person can grind a database to a halt

Not on a properly resourced server. But this is not a properly resourced server.

is it possible to build in safeguards?

Probably, but I am hoping my career as a server administrator is over!

Makes me think I need to learn and then teach methods for checking and killing spinning jobs at the very beginning of the SQL section...

Probably a good idea. This is one reason why I usually teach this unit using a proper SQL editor like DBeaver, rather than RStudio. I teach them to code the way I do -- fiddle with the queries in DBeaver until you get them right, and only then do you copy-and-paste them into your Quarto file.

Also, I try to turn this frustration into a teachable moment. This may be the first time that they have ever run up against the limits of their hardware. If you can channel their frustration into motivation to actually care about query efficiency, you will have achieved something!