understory collect() takes forever, even with filter

jakegross808 commented 2 years ago

I wonder if it is a problem with the joins?? lines 368-373 in utils.R

wright13 commented 2 years ago

Joins seem fine to me! I dug into this a bit further and it seems like talking to the database is the bottleneck. Any time you pull millions of rows of data from the database over the network, it's just going to be unavoidably slow. I did discover that reading the data in seems to be the most time consuming step by a lot. R seems to be able to operate on the data very speedily after that. Which leaves us with a few options:

Read full dataset into R, then cache locally
- Pros: minimize reading from db, works well whether data are stored in csv or in db
- Cons: reading from db is slow. Eventually data will be too large to work with in-memory
Create table connections and do filtering and summaries on the database side before reading data in
- Pros: will still work when raw dataset is too large to fit into R's memory as long as filtered/summarized dataset is small enough
- Cons: pulling data over the network from the database is never going to be fast, and there may be better large data solutions by the time we reach that point. Network connection required.

I'm actually leaning towards option 1 now. Thoughts? It wouldn't be a huge ordeal to switch methods in the future.

wright13 commented 2 years ago

...although it does take a LONG time to load the full dataset from the database

jakegross808 commented 2 years ago

Did you get it to speed up at all by using filtered Understory data? I tried inserting this line: `filter(Unit_Code == "AMME") %>%

But it still seems to be just as slow... That's what made me wonder if there is something else going on. Next I might try to filter down to just a few records before collecting. My hunch is it will still be slow....

jakegross808 commented 2 years ago

Maybe the issue it that dplyr has to right_join ALL records by event before it can filter by event...

wright13 commented 2 years ago

Filtering on unit code prior to joining will probably help, but the query returns about 3 million rows so it'll take time for the database to perform the joins and it'll take time to move the data across the network. It still seems to be faster than fetching the full dataset, but yeah it's definitely slooooow. On a hunch I just tried reading understory cover and the high and low xref tables in first, then doing the joins in R, and it is significantly faster. The downside is that the full dataset does take up a lot of memory. I'm playing around with a couple options right now, will keep you posted...

wright13 commented 2 years ago

Okay I'm back around to thinking that we should stick with the original plan of using dbplyr to do as much on the SQL side as possible before loading the (hopefully filtered and summarized) dataset into R. 🤦 Went down a lot of Google rabbitholes and realized that I really need to take a class on big data but for the moment I think this is our best option. I did learn that one of the popular big data packages for R (sparklyr) also uses dbplyr, so if we ever get IT approval for Spark then that could be a good option in the future since it wouldn't involve many changes to the code.

jakegross808 commented 2 years ago

Sounds good to me. We will still have something like a cached version of the collect() data to work with right? So won't need to hit database and download everytime a small change is made to an Rmarkdown for instance?

wright13 commented 2 years ago

Yeah I'm currently trying to figure out how to make that work. R runs out of memory when trying to write the understory data to a local file, so I'm working on finding a way to do it in smaller pieces.

jakegross808 commented 2 years ago

Ok. Thank you! So glad to have your help with this! Another possibility - just throwing it out there. Is it possible to have local files that only get updated with new changes from the database (so the whole thing doesn't need to update each time - just the new data) kind-of like a sync? I have no idea if that's even possible.

wright13 commented 2 years ago

Oh that's an interesting thought. I can't think of a good way to do it in R without having to read all the data in anyway, since it'd have to check for updated (as opposed to new) rows of data. But one possible option would be to look into whether you can set up a SQL Server instance on your local machine that stays in sync with the main db. That way you wouldn't have to pull data over the network all the time, which is the really slow part. That said, that's not my area of expertise so I'm not sure if it'd even be possible in this particular scenario. I can ask around though and see if anyone else is doing something similar.

wright13 commented 2 years ago

Ok it sounds like setting up a local SQL Server database on your computer that syncs from the Seattle server may in fact be our best bet. Should we bring Kelly into this conversation? If we go this route, we should decide whether it still makes sense to provide a csv import/export option (I think yes?)

wright13 commented 2 years ago

....oh wowww I just realized that our queries were wrong, we should have been joining understory cover to the xref tables on Event_ID in addition to Point_ID. That makes the size of the data completely manageable. I can't believe it took me until now to catch that! 🤦

jakegross808 / pacn-veg-package

understory collect() takes forever, even with filter #16