moj-analytical-services / Rdbtools

Accessing Athena on the Analytical Platform
Other
4 stars 0 forks source link

temporary tables not temporary #26

Closed ms130 closed 3 months ago

ms130 commented 11 months ago

Hi,

I'd like to create a temporary table, i.e. one that doesn't persist, from a data frame, however this does not currently seem possible out of the box. The closest thing to it I currently do is:

library(Rdbtools)
con <- connect_athena()
dbWriteTable(
    conn = con, 
    value = df, 
    name ="__temp__.temp_table", 
    overwrite = TRUE,
    s3.location = "s3://alpha-my-bucket/"
  )
dbDisconnect(con)

where s3://alpha-my-bucket/ is an S3 bucket I have access to. This adds the table inside a new folder in that bucket, so I would have to delete the table myself to remove it.

Could the function reset the s3.location to something like the following automatically, when __temp__ appears in name?:

user_id <- paws::sts()$get_caller_identity()$UserId
s3_location = paste("s3://mojap-athena-query-dump", user_id, sep = "/")

Or another solution that creates a temporary table?

ms130 commented 10 months ago

@pjrh-moj any thoughts on this?

pjrh-moj commented 10 months ago

It would be nice to be able to ignore the s3.location path for temporary tables - it is a bit of a faff.

But I think we will end up with permission issues if not carefully - it would only be safe for Rdbtools to automatically choose a location if the user had access to it but no one else. Does the mojap-athena-query-dump bucket work like that?

If it does then I think your solution works really well - but we need to avoid accidentally giving people access to data they shouldn't have.

jlandsiedel commented 7 months ago

Hi, I was wondering whether anything else has happened since this issue was. opened. I basically have the same goal as @ms130 wanting to put a dataframe into a temporary table to be able to use it for a join with an existing database. I was hoping to use the dbplyr::memdb_frame() function but the problem is that this creates the temporary table with RSQLite not noctua so it is not within Athena and I seem to be unable to change the destination for it. Maybe this will revive this thread and thanks for any thoughts or suggestions!

pjrh-moj commented 7 months ago

Hi @jlandsiedel,

I think you should be able to use dbWriteTable with an Athena connection object as Marco suggests above - the issue is that for some reason the new table you write will not be temporary. You can always manually remove the table after use. The only tricky thing is the permissions: having a database and specifying an s3 location that you can read/write to.

Sorry if I have missed an important detail! Or else, if you are getting errors can you post them here?

Thanks, Peter

ms130 commented 7 months ago

From what I remember, pydbtools does let you write non-persisting temporary tables to a secure location others can't access. So perhaps the same approach used there could also be applied here to select a suitable destination for temporary tables?

I'm not that familiar with the details of how Rdbtools/noctua works, but perhaps @pjrh-moj or @mratford would have some more insight on this?

pjrh-moj commented 7 months ago

It is a funny permission issue, but Mike might be able to shed more light. For instance, I can create temporary tables in a scratch area with the following:

ath <- Rdbtools::connect_athena()
Rdbtools::dbExecute(ath, "CREATE TABLE __temp__.temp_test AS SELECT * FROM db.table LIMIT 10")
temp_local <- Rdbtools::dbGetQuery(ath, "SELECT * FROM __temp__.temp_test")

Which copies the syntax for pydbtools, and writes to a subfolder of s3://mojap-athena-query-dump as expected for creating a temporary table from within Athena - so the permissions are fine for that. But when I attempt to use dbWriteTable to write to the same place in s3://mojap-athena-query-dump then I get access denied.

@ms130 - is there anything Rdbtools is preventing you from doing completely, or is it just that you need to write more steps to get it done because there isn't a default temporary location that works?

ms130 commented 7 months ago

@ms130 - is there anything Rdbtools is preventing you from doing completely, or is it just that you need to write more steps to get it done because there isn't a default temporary location that works?

For me it's the latter. I wouldn't say that I'm prevented from doing something I need to.

mratford commented 7 months ago

There's a daily airflow task that cleans up temporary database tables that were created over a day ago - https://github.com/moj-analytical-services/airflow-clean-temp-dbs/blob/master/clean.py. This looks for any database that begins mojap_de_temp_, which is the temp database naming convention used by both pydbtools and Rdbtools.

I'm not immediately sure why dbWriteTable has that permissions problem, that would need some investigation, but in case it helps pydbtools as a function that writes a dataframe to a temporary table - https://github.com/moj-analytical-services/pydbtools/blob/526c2eb4942143c6ae5da373a6c4893128e24e8a/pydbtools/_wrangler.py#L600-L620

jlandsiedel commented 7 months ago

Hi @pjrh-moj ,

Thanks, yes the dbWriteTable solution is working for me but it feels rather clunky.

The real issue is that dbplyr::memdb_frame which in theory should create a temporary table does not work in conjunction with noctua. It needs the Rpackage RSQLite and will create the temporary table there without any issue. The issue then is that the tables are in different locations, one within Athena, and the temporary one in sqlite. In theory dplyr has a copy_to function which should allow bringing things into the same space but this fails with an error for me.

I am not sure whether this would still fall within the scope of this thread / the scope of Rdbtools more generally though, so happy to go with the dbWriteTable solution.

jlandsiedel commented 7 months ago

Just to add this is the error I am getting:

dbplyr::memdb_frame(iris)
Error in (function (cond) :
error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': there is no package called ‘RSQLite

Screenshot 2024-02-06 at 12 43 11

I thought dbConnect is part of DBI/noctua hence why I picked up this thread. I can of course install RSQLite but that leads to the other issue described above.

pjrh-moj commented 7 months ago

Thanks, that's useful. On the face of it then it looks like pydbtools is doing the same thing - I have reached the limit of my AWS knowledge I am afraid...

pjrh-moj commented 7 months ago

However, there is one bit of convenience which I did build in to Rdbtools to make it possible to use the dplyr way of working, you just have to set the default schema and staging dir when you make the connection object. For instance this works:

df <- tibble(a = c(1,2,3), b = c("alpha", "beta", "gamma"))

ath_con <- Rdbtools::connect_athena(staging_dir = "s3://bucket_with_write_permission/temp", schema_name = "__temp__")
df_db <- dplyr::copy_to(ath_con, df, "dplyr_test")

If you can wrangle the other data you need into __temp__ schema then you can do all dplyr joins etc. you want using the df_db object. Or else you can still access the data with the DBI methods too (e.g. Rdbtools::dbGetQuery(ath_con, "SELECT * FROM __temp__.dplyr_test")

That still requires you to specify your own bucket for staging, but simplifies the rest of it too. Hopefully that gets us half way there, and I am a big fan of dbplyr so let me know if there are still gaps!

jlandsiedel commented 7 months ago

Thanks so much @pjrh-moj . I had set up my Athena connection with __temp__ and the staging directory but clearly had gotten the syntax for copy_to wrong. I've used your example and this works perfectly for me now so that solves all the issues I had and will save me from using dbWriteTable.

pjrh-moj commented 7 months ago

Great! If that works for you, and after you've had a few days to iron out any further issues, then I would be really grateful if you could look back over the Rdbtools README, or function documentation, to make it clearer for anyone else who wants to do the same as you. I think what you are describing is a useful way of working, and I want to make sure it is clear to anyone coming to this how to do it.

Also, using the dplyr commands on Athena is one of the principle motivations for Rdbtools so keen to make it as easy as possible.

jlandsiedel commented 7 months ago

Hi Peter, Sure, will do. It would probably be good to have some sample code in the README that showcases how to convert local data frames to temporary tables using the different methods discussed in this thread but I'll take a closer look later on.

jlandsiedel commented 7 months ago

I'm afraid the code gods had other ideas than this going smoothly after all. I only briefly tested the copy_to earlier with the iris dataset and it worked all fine but I am now getting an error when using it on my own data. The traceback is so long it needs two screenshots Screenshot 2024-02-06 at 15 51 23 Screenshot 2024-02-06 at 15 51 52

Any ideas what the error means?

pjrh-moj commented 7 months ago

It looks like it doesn't think cohort_data is the right data type or something - you've clearly already spotted that...

Do you get the same thing if you try to copy_to a new dataframe which is just the first 10 rows or something? It looks like it is batching up large datasets and maybe something is going wrong there?

As a wildcard option you could try adding the file.type = "parquet" as an extra argument to copy_to. I don't have a reason why this would be better, but Athena does seem to complain slightly less with parquet than csv in my experience.

jlandsiedel commented 7 months ago

Hi Peter, Apologies for my slow reply, I was in training most of last week and off on Friday. Strangely enough the copy_to seems to work for me today with the same data and no changes to the code. The only difference being (from what I can see) that I'm on a different branch of my repo. I'll test and investigate a bit more and let you know how things go.

pjrh-moj commented 3 months ago

I'm closing this because of a lack of activity, and the conversation moved on from the initial issue.

Do open a new issue if there is anything unresolved which you want to hold as an issue on this repo.