paws-r / paws

Paws, a package for Amazon Web Services in R
https://www.paws-r-sdk.com
Other
305 stars 37 forks source link

is it possible to list delete markers with s3$list_object_versions? #791

Open FMKerckhof opened 6 days ago

FMKerckhof commented 6 days ago

Hi,

I am trying to remove delete markers from a list of objects (i.e. restore them to the last non-deleted version). I thought it would be possible by using s3$list_object_versions and then call s3$delete_object with those VersionIds specified. Sadly, I cannot figure out how to return the delete_markers as object versions?

Is there any way to go around this with paws::s3() ?

I checked https://github.com/paws-r/paws/issues/471 but I don't see how it would help me.

Thanks in advance for any hints.

Kind regards,

FM

DyfanJones commented 6 days ago

Hi @FMKerckhof do you have a code example of this issue to help me get a better idea in what you are trying to achieve :)

FMKerckhof commented 6 days ago

Apologies @DyfanJones , indeed that was a poor reprex, it's just not easy with permissions for S3 etc.

I will try to outline my workflow below.

# load required packages -----
library(paws)

# set env vars ----
# here, I set my AWS_SECRET_ACCES_KEY, AWS_DEFAULT_REGION and AWS_ACCESS_KEY_ID

# load objects data_frame ----

removed_objects <- read.csv("removed_objects.csv")

# removed_objects is a 1-column data frame that contains in every row a key of an object that was deleted before
#  I used aws.s3::delete_object() for this, as I am only recently transitioning to paws
#  the bucket that the objects were in is versioned
# the objects in the dataframe have several versions, and the most recent "version" is the delete marker

# create s3 connector ------
s3 <- paws::s3()

bucket <- "<my-versioned-bucket>"

objectversions <- s3$list_object_versions(Bucket = bucket,
                                  Prefix = "<specific-prefix>")

# what I had hoped from here is that objectversions would have the delete marker in the "latest" version - but
# when I run the code below, the delete markers are not shown, and non of the versions is latest:

sapply(objectversions $Versions,function(x)x$IsLatest)

# output: FALSE FALSE FALSE FALSE  ... (for the length of the list)

So, my questions are:

  1. Is it possible to return the versionId's of object delete markers to programmatically remove them using s3$delete_object?
  2. Am I going about this the right way, or is there an easier way in paws s3 to remove all delete markers for a given prefix?
FMKerckhof commented 6 days ago

versionsandmarkers

FMKerckhof commented 6 days ago

The corresponding AWS CLI s3api command would be aws s3api list-object-versions --bucket DOC-EXAMPLE-BUCKET --prefix examplefolder/ --output json --query 'DeleteMarkers[?IsLatest==true].[Key, VersionId]' | jq -r '.[] | "--key " + "'\\\"'" + .[0] + "'\\\"'" + " --version-id " + .[1]' | xargs -L1 -t aws s3api delete-object --bucket DOC-EXAMPLE-BUCKET - from: https://repost.aws/knowledge-center/s3-undelete-configuration

DyfanJones commented 6 days ago

Oh I see, would the following example help at all :)

library(paws)

cfg = s3(config(credentials(profile = "paws")))

bucket = "paws-version"
key = "made-up"
resp <- cfg$put_object(
  Bucket=bucket,
  Key=key,
  Body=charToRaw("hello world")
)

Sys.sleep(2)
resp <- cfg$list_object_versions(
  Bucket = bucket, Prefix = key
)
resp$Versions |> tibblify::tibblify()
#> The spec contains 1 unspecified field:
#> • ChecksumAlgorithm
#> # A tibble: 7 × 10
#>   ETag             ChecksumAlgorithm  Size StorageClass Key   VersionId IsLatest
#>   <chr>            <list>            <dbl> <chr>        <chr> <chr>     <lgl>   
#> 1 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… H_LHXW2x… TRUE    
#> 2 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… 3qe20R36… FALSE   
#> 3 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… GvooIndy… FALSE   
#> 4 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… Rma4F3KT… FALSE   
#> 5 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… DgZPxM1c… FALSE   
#> 6 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… HDda3wDD… FALSE   
#> 7 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… YPH2FjWf… FALSE   
#> # ℹ 3 more variables: LastModified <dttm>, Owner <tibble[,2]>,
#> #   RestoreStatus <tibble[,2]>

resp$DeleteMarkers |> tibblify::tibblify()
#> # A tibble: 2 × 5
#>   Owner$DisplayName $ID             Key   VersionId IsLatest LastModified       
#>   <chr>             <chr>           <chr> <chr>     <lgl>    <dttm>             
#> 1 dyfan.r.jones     b3f6bed018b0c5… made… QeNOp9Jf… FALSE    2024-07-01 17:44:46
#> 2 dyfan.r.jones     b3f6bed018b0c5… made… jcqpX.BP… FALSE    2024-07-01 17:43:29

Created on 2024-07-01 with reprex v2.1.0

Note: I only used tibblify as I was being lazy :P

DyfanJones commented 6 days ago

Let me know if I have answered your question correctly :)

library(paws)

cfg = s3(config(credentials(profile = "paws")))

bucket = "paws-version"
key = "made-up"
resp <- cfg$put_object(
  Bucket=bucket,
  Key=key,
  Body=charToRaw("hello world")
)

Sys.sleep(2)
resp <- cfg$list_object_versions(
  Bucket = bucket, Prefix = key
)
resp$Versions |> tibblify::tibblify()
#> The spec contains 1 unspecified field:
#> • ChecksumAlgorithm
#> # A tibble: 9 × 10
#>   ETag             ChecksumAlgorithm  Size StorageClass Key   VersionId IsLatest
#>   <chr>            <list>            <dbl> <chr>        <chr> <chr>     <lgl>   
#> 1 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… 5gLM9Fin… TRUE    
#> 2 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… kqZV41GT… FALSE   
#> 3 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… H_LHXW2x… FALSE   
#> 4 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… 3qe20R36… FALSE   
#> 5 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… GvooIndy… FALSE   
#> 6 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… Rma4F3KT… FALSE   
#> 7 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… DgZPxM1c… FALSE   
#> 8 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… HDda3wDD… FALSE   
#> 9 "\"5eb63bbbe01e… <list [0]>           11 STANDARD     made… YPH2FjWf… FALSE   
#> # ℹ 3 more variables: LastModified <dttm>, Owner <tibble[,2]>,
#> #   RestoreStatus <tibble[,2]>

resp$DeleteMarkers |> tibblify::tibblify()
#> # A tibble: 1 × 5
#>   Owner$DisplayName $ID             Key   VersionId IsLatest LastModified       
#>   <chr>             <chr>           <chr> <chr>     <lgl>    <dttm>             
#> 1 dyfan.r.jones     b3f6bed018b0c5… made… vALRyb2N… FALSE    2024-07-01 17:50:09

cfg$delete_objects(
  Bucket = bucket,
  Delete = list(
    Objects = lapply(resp$DeleteMarkers, \(resp) list(Key = key, VersionId = resp$VersionId)),
    Quiet = T
  )
)
#> $Deleted
#> list()
#> 
#> $RequestCharged
#> character(0)
#> 
#> $Errors
#> list()

resp <- cfg$list_object_versions(
  Bucket = bucket, Prefix = key
)
resp$DeleteMarkers
#> list()

Created on 2024-07-01 with reprex v2.1.0

FMKerckhof commented 6 days ago

Hi @DyfanJones : thank you for this feedback - I indeed missed the DeleteMarkers slot, since I wrongly assumed they would be listed in the Versions slot. Nonetheless, the $DeleteMarkers returns an empty list() even though I have delete markers in my bucket (cf. the screenshot) - is there any reason why this could be the case? The response definitely is truncated (I have about 1000 keys to inspect and each have 3-4 versions, excluding the delete markers) - could that be the reason?

FMKerckhof commented 6 days ago

fyi - I am using paws 0.6.0 on R 4.4.0 for windows

DyfanJones commented 6 days ago

What version of paws.common do you have? And what region is your Aws S3 Bucket?

FMKerckhof commented 6 days ago

paws.common 0.7.3 and my bucket is in eu-central-1

DyfanJones commented 6 days ago

The reason for it truncated to 1000 is the limitations of Aws API. And you will need to use the paginate function to get all pages

FMKerckhof commented 6 days ago

Thank you, I used the $NextKeyMarker and $NextVersionIdMarker to paginate through the keys and versions until $IsTruncated was FALSE. Still, all $DeleteMarkers returned an empty list() object.

DyfanJones commented 5 days ago

@FMKerckhof the same can be achieved by using the paginate function:

library(paws)
cfg <- s3()

bucket <- "your-bucket"
key <- "path/to/file.txt"

resp <- cfg$list_object_versions(
  Bucket = bucket, Prefix = key
) |> paginate()

The paginate function should get all pages for you and make it alittle easier when working with multiple pages in aws (https://www.paws-r-sdk.com/developer_guide/paginators/)

You also have the benefit of stopping the paginate early for example if you only want 1500 with pages size of 500:

library(paws)
cfg <- s3()

bucket <- "your-bucket"
key <- "path/to/file.txt"

resp <- cfg$list_object_versions(
  Bucket = bucket, Prefix = key
) |> paginate(PageSize = 500, MaxItems = 1500)
DyfanJones commented 5 days ago

@FMKerckhof I believe I have found a bug that is causing DeleteMarkers return as list() for you.

Please try.

remotes::install_github("dyfanjones/paws/paws.common", ref = "bug-transpose")

In short, the old method was taking the first element of the list to fill in empty spaces. However this is floored as it didn't take into account the first element being empty itself:

# old:
transpose_old <- function (x) {
    if (any(found <- lengths(x) == 0)) {
        x[found] <- list(rep(list(), length.out = length(x[[1]])))
    }
    .mapply(list, x, NULL)
}

# new:
transpose_new <- function (x) {
    lens <- lengths(x)
    if (any(found <- lens == 0)) {
        x[found] <- list(rep(list(), length.out = max(lens)))
    }
    .mapply(list, x, NULL)
}

obj <- list(
    var1 = list(),
    var2 = c(1,2,3)
)

transpose_old(obj)
#> list()

transpose_new(obj)
#> [[1]]
#> [[1]]$var1
#> NULL
#> 
#> [[1]]$var2
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$var1
#> NULL
#> 
#> [[2]]$var2
#> [1] 2
#> 
#> 
#> [[3]]
#> [[3]]$var1
#> NULL
#> 
#> [[3]]$var2
#> [1] 3

Created on 2024-07-02 with reprex v2.1.0

DyfanJones commented 5 days ago

I will leave this open until I release paws.common 0.7.4 to the cran.

FMKerckhof commented 4 days ago

Hi @DyfanJones : thank you for figuring this out. I installed your patch using remotes::install_github() and now indeed my DeleteMarkers are returned just fine. 👍 Looking forward for next CRAN release of paws.common .