Closed ms130 closed 3 months ago
@pjrh-moj any thoughts on this?
It would be nice to be able to ignore the s3.location
path for temporary tables - it is a bit of a faff.
But I think we will end up with permission issues if not carefully - it would only be safe for Rdbtools
to automatically choose a location if the user had access to it but no one else. Does the mojap-athena-query-dump
bucket work like that?
If it does then I think your solution works really well - but we need to avoid accidentally giving people access to data they shouldn't have.
Hi,
I was wondering whether anything else has happened since this issue was. opened. I basically have the same goal as @ms130 wanting to put a dataframe into a temporary table to be able to use it for a join with an existing database. I was hoping to use the dbplyr::memdb_frame()
function but the problem is that this creates the temporary table with RSQLite not noctua so it is not within Athena and I seem to be unable to change the destination for it. Maybe this will revive this thread and thanks for any thoughts or suggestions!
Hi @jlandsiedel,
I think you should be able to use dbWriteTable
with an Athena connection object as Marco suggests above - the issue is that for some reason the new table you write will not be temporary. You can always manually remove the table after use. The only tricky thing is the permissions: having a database and specifying an s3 location that you can read/write to.
Sorry if I have missed an important detail! Or else, if you are getting errors can you post them here?
Thanks, Peter
From what I remember, pydbtools does let you write non-persisting temporary tables to a secure location others can't access. So perhaps the same approach used there could also be applied here to select a suitable destination for temporary tables?
I'm not that familiar with the details of how Rdbtools/noctua
works, but perhaps @pjrh-moj or @mratford would have some more insight on this?
It is a funny permission issue, but Mike might be able to shed more light. For instance, I can create temporary tables in a scratch area with the following:
ath <- Rdbtools::connect_athena()
Rdbtools::dbExecute(ath, "CREATE TABLE __temp__.temp_test AS SELECT * FROM db.table LIMIT 10")
temp_local <- Rdbtools::dbGetQuery(ath, "SELECT * FROM __temp__.temp_test")
Which copies the syntax for pydbtools
, and writes to a subfolder of s3://mojap-athena-query-dump
as expected for creating a temporary table from within Athena - so the permissions are fine for that. But when I attempt to use dbWriteTable
to write to the same place in s3://mojap-athena-query-dump
then I get access denied.
@ms130 - is there anything Rdbtools
is preventing you from doing completely, or is it just that you need to write more steps to get it done because there isn't a default temporary location that works?
@ms130 - is there anything
Rdbtools
is preventing you from doing completely, or is it just that you need to write more steps to get it done because there isn't a default temporary location that works?
For me it's the latter. I wouldn't say that I'm prevented from doing something I need to.
There's a daily airflow task that cleans up temporary database tables that were created over a day ago - https://github.com/moj-analytical-services/airflow-clean-temp-dbs/blob/master/clean.py. This looks for any database that begins mojap_de_temp_
, which is the temp database naming convention used by both pydbtools and Rdbtools.
I'm not immediately sure why dbWriteTable
has that permissions problem, that would need some investigation, but in case it helps pydbtools as a function that writes a dataframe to a temporary table - https://github.com/moj-analytical-services/pydbtools/blob/526c2eb4942143c6ae5da373a6c4893128e24e8a/pydbtools/_wrangler.py#L600-L620
Hi @pjrh-moj ,
Thanks, yes the dbWriteTable
solution is working for me but it feels rather clunky.
The real issue is that dbplyr::memdb_frame
which in theory should create a temporary table does not work in conjunction with noctua
. It needs the Rpackage RSQLite
and will create the temporary table there without any issue. The issue then is that the tables are in different locations, one within Athena, and the temporary one in sqlite
. In theory dplyr
has a copy_to
function which should allow bringing things into the same space but this fails with an error for me.
I am not sure whether this would still fall within the scope of this thread / the scope of Rdbtools
more generally though, so happy to go with the dbWriteTable
solution.
Just to add this is the error I am getting:
dbplyr::memdb_frame(iris)
Error in (function (cond) :
error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': there is no package called ‘RSQLite
I thought dbConnect
is part of DBI
/noctua
hence why I picked up this thread. I can of course install RSQLite
but that leads to the other issue described above.
Thanks, that's useful. On the face of it then it looks like pydbtools
is doing the same thing - I have reached the limit of my AWS knowledge I am afraid...
However, there is one bit of convenience which I did build in to Rdbtools
to make it possible to use the dplyr
way of working, you just have to set the default schema and staging dir when you make the connection object. For instance this works:
df <- tibble(a = c(1,2,3), b = c("alpha", "beta", "gamma"))
ath_con <- Rdbtools::connect_athena(staging_dir = "s3://bucket_with_write_permission/temp", schema_name = "__temp__")
df_db <- dplyr::copy_to(ath_con, df, "dplyr_test")
If you can wrangle the other data you need into __temp__
schema then you can do all dplyr
joins etc. you want using the df_db
object. Or else you can still access the data with the DBI
methods too (e.g. Rdbtools::dbGetQuery(ath_con, "SELECT * FROM __temp__.dplyr_test")
That still requires you to specify your own bucket for staging, but simplifies the rest of it too. Hopefully that gets us half way there, and I am a big fan of dbplyr
so let me know if there are still gaps!
Thanks so much @pjrh-moj . I had set up my Athena connection with __temp__
and the staging directory but clearly had gotten the syntax for copy_to
wrong. I've used your example and this works perfectly for me now so that solves all the issues I had and will save me from using dbWriteTable
.
Great! If that works for you, and after you've had a few days to iron out any further issues, then I would be really grateful if you could look back over the Rdbtools
README, or function documentation, to make it clearer for anyone else who wants to do the same as you. I think what you are describing is a useful way of working, and I want to make sure it is clear to anyone coming to this how to do it.
Also, using the dplyr
commands on Athena is one of the principle motivations for Rdbtools
so keen to make it as easy as possible.
Hi Peter, Sure, will do. It would probably be good to have some sample code in the README that showcases how to convert local data frames to temporary tables using the different methods discussed in this thread but I'll take a closer look later on.
I'm afraid the code gods had other ideas than this going smoothly after all. I only briefly tested the copy_to
earlier with the iris
dataset and it worked all fine but I am now getting an error when using it on my own data. The traceback is so long it needs two screenshots
Any ideas what the error means?
It looks like it doesn't think cohort_data is the right data type or something - you've clearly already spotted that...
Do you get the same thing if you try to copy_to
a new dataframe which is just the first 10 rows or something? It looks like it is batching up large datasets and maybe something is going wrong there?
As a wildcard option you could try adding the file.type = "parquet"
as an extra argument to copy_to
. I don't have a reason why this would be better, but Athena does seem to complain slightly less with parquet than csv in my experience.
Hi Peter,
Apologies for my slow reply, I was in training most of last week and off on Friday. Strangely enough the copy_to
seems to work for me today with the same data and no changes to the code. The only difference being (from what I can see) that I'm on a different branch of my repo. I'll test and investigate a bit more and let you know how things go.
I'm closing this because of a lack of activity, and the conversation moved on from the initial issue.
Do open a new issue if there is anything unresolved which you want to hold as an issue on this repo.
Hi,
I'd like to create a temporary table, i.e. one that doesn't persist, from a data frame, however this does not currently seem possible out of the box. The closest thing to it I currently do is:
where
s3://alpha-my-bucket/
is an S3 bucket I have access to. This adds the table inside a new folder in that bucket, so I would have to delete the table myself to remove it.Could the function reset the
s3.location
to something like the following automatically, when__temp__
appears inname
?:Or another solution that creates a temporary table?