paws-r / paws

Paws, a package for Amazon Web Services in R
https://www.paws-r-sdk.com
Other
315 stars 37 forks source link

Infinite loop while paginating paws.management::cloudwatchlogs()$get_log_events() #721

Closed wlandau closed 9 months ago

wlandau commented 10 months ago

I am working on https://github.com/wlandau/crew.aws.batch/issues/2, and things are mostly coming along well. However, I notice that when I try to use pagination to get the full log of an AWS Batch job, the call to get_log_events() hangs. Turning on verbose logging with options(paws.log_level = 3), it appears that the function call churns in an infinite loop even when the log is very small. I am using paws.common 0.6.4 and paws.management 0.4.0 (both current CRAN). Here is my reprex:

library(paws.common)
library(paws.compute)
library(paws.management)
options(paws.log_level = 3)
region <- "us-east-2"
client_batch <- batch(region = region)
client_cloudwatchlogs <- cloudwatchlogs(region = region)
definition <- client_batch$register_job_definition(
  jobDefinitionName = "crew-aws-batch-test",
  type = "container",
  platformCapabilities = "EC2",
  timeout = list(attemptDurationSeconds = 600),
  containerProperties = list(
    image = "alpine:latest",
    resourceRequirements = list(
      memory = list(value = "128", type = "MEMORY"),
      cpus = list(value = "1", type = "VCPU")),
    logConfiguration = list(
      logDriver = "awslogs",
      options = list(
        "awslogs-group" = "/aws/batch/job",
        "awslogs-region" = "us-east-2",
        "awslogs-stream-prefix" = "crew-aws-batch-test")
    )
  )
)
job <- client_batch$submit_job(
  jobName = "crew-aws-batch-test-job",
  jobQueue = "crew-aws-batch-job-queue",
  jobDefinition = "crew-aws-batch-test",
  timeout = list(attemptDurationSeconds = 600),
  containerOverrides = list(
    command = list("echo", "done with container\ndone with job"),
    resourceRequirements = list(
      memory = list(value = "128", type = "MEMORY"),
      cpus = list(value = "1", type = "VCPU")
    )
  )
)
while (TRUE) {
  info <- client_batch$describe_jobs(jobs = job$jobId)
  status <- info$jobs[[1L]]$status
  message(paste("job status:", status))
  if (status %in% c("SUCCEEDED", "FAILED")) {
    break
  }
  Sys.sleep(5)
}
log_stream_name <- info$jobs[[1L]]$container$logStreamName
pages <- paws.common::paginate(
  client_cloudwatchlogs$get_log_events(
    logGroupName = "/aws/batch/job",
    logStreamName = log_stream_name,
    startFromHead = FALSE
  )
)

On my end, the R console output looks like this (although I censored the sensitive information). It never stops printing messages like these.

INFO [2023-12-08 15:06:23.988]: -> CONNECT logs.us-east-2.amazonaws.com:443 HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com:443
-> User-Agent: libcurl/8.1.2 r-curl/5.1.0 httr/1.4.7
-> Proxy-Connection: Keep-Alive
-> 
INFO [2023-12-08 15:06:24.233]: <- HTTP/1.1 200 Connection established
INFO [2023-12-08 15:06:24.233]: <- 
INFO [2023-12-08 15:06:24.654]: -> POST / HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: paws/0.6.4 (R4.3.2; darwin20; aarch64)
-> X-Amz-Target: Logs_20140328.GetLogEvents
-> Content-Type: application/x-amz-json-1.1
-> Content-Length: 134
-> X-Amz-Security-Token: CENSORED
-> X-Amz-Date: 20231208T200623Z
-> Authorization: CENSORED
-> 
INFO [2023-12-08 15:06:24.655]: >> {"logGroupName":"/aws/batch/job","logStreamName":"crew-aws-batch-test/default/CENSORED","startFromHead":false}

INFO [2023-12-08 15:06:24.854]: <- HTTP/1.1 200 OK
INFO [2023-12-08 15:06:24.854]: <- x-amzn-RequestId: 368eb364-dc05-4e9f-b45b-ed9a6f453e88
INFO [2023-12-08 15:06:24.854]: <- Content-Type: application/x-amz-json-1.1
INFO [2023-12-08 15:06:24.854]: <- Content-Length: 489
INFO [2023-12-08 15:06:24.854]: <- Date: Fri, 08 Dec 2023 20:06:24 GMT
INFO [2023-12-08 15:06:24.854]: <- 
INFO [2023-12-08 15:06:24.867]: -> POST / HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: paws/0.6.4 (R4.3.2; darwin20; aarch64)
-> X-Amz-Target: Logs_20140328.GetLogEvents
-> Content-Type: application/x-amz-json-1.1
-> Content-Length: 209
-> X-Amz-Security-Token: CENSORED
-> X-Amz-Date: 20231208T200624Z
-> Authorization: CENSORED
-> 
INFO [2023-12-08 15:06:24.867]: >> {"logGroupName":"/aws/batch/job","logStreamName":"crew-aws-batch-test/default/CENSORED","nextToken":"f/CENSORED/s","startFromHead":false}

Session info:

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Indiana/Indianapolis
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] crew.aws.batch_0.0.0.9001 testthat_3.2.0            paws.management_0.4.0    
[4] paws.compute_0.4.0        paws.common_0.6.4        

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        stringi_1.8.2     digest_0.6.33     magrittr_2.0.3    pkgload_1.3.3    
 [6] fastmap_1.1.1     jsonlite_1.8.8    rprojroot_2.0.4   processx_3.8.2    pkgbuild_1.4.2   
[11] sessioninfo_1.2.2 brio_1.1.3        urlchecker_1.0.1  ps_1.7.5          promises_1.2.1   
[16] httr_1.4.7        purrr_1.0.2       fansi_1.0.5       crew_0.7.0        cli_3.6.1        
[21] shiny_1.7.5.1     rlang_1.1.2       crayon_1.5.2      ellipsis_0.3.2    withr_2.5.2      
[26] remotes_2.4.2.1   cachem_1.0.8      devtools_2.4.5    tools_4.3.2       parallel_4.3.2   
[31] memoise_2.0.1     getip_0.1-3       nanonext_0.11.0   httpuv_1.6.11     curl_5.1.0       
[36] vctrs_0.6.5       R6_2.5.1          mime_0.12         lifecycle_1.0.4   stringr_1.5.1    
[41] fs_1.6.3          htmlwidgets_1.6.2 usethis_2.2.2     miniUI_0.1.1.1    desc_1.4.2       
[46] pkgconfig_2.0.3   callr_3.7.3       pillar_1.9.0      later_1.3.1       data.table_1.14.8
[51] glue_1.6.2        profvis_0.3.8     Rcpp_1.0.11       tibble_3.2.1      tidyselect_1.2.0 
[56] rstudioapi_0.15.0 xtable_1.8-4      htmltools_0.5.6.1 mirai_0.11.3      compiler_4.3.2   
[61] prettyunits_1.2.0
wlandau commented 10 months ago

By the way, is there a completely straightforward and reliable way to get the CloudWatch log group given a job ID, job name, or even job ARN? I can't find it in the output of describe_jobs() or describe_job_definitions(), so for the sake of https://github.com/wlandau/crew.aws.batch/issues/2, I am having users create their own job definitions and supply an explicit log group name with default "/aws/batch/job". Not ideal if someone's sys admins use a custom logging scheme which users don't know about, so I'd be interested in a better approach if there is one.

DyfanJones commented 10 months ago

Sorry about that @wlandau I will try and get this in for paws.common 0.6.5. Off the top of my head I am not sure but I will have a little look into it :)

DyfanJones commented 9 months ago

Hmmm it looks like get_log_events isn't a paginator in boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#paginators

From digging a little more, there seems to be some issues raised on botocore and aws-sdk around this:

https://github.com/boto/botocore/issues/1474 https://github.com/aws/aws-sdk/issues/422

DyfanJones commented 9 months ago

Not sure what to do on this one as I would like to keep align with the other aws-sdks however it doesn't feel difficult to implement.

From my understanding it should be checking the primary_result_key for any results (similar to MaxItems). Initial idea for implementation: https://github.com/DyfanJones/paws/blob/paginator_empty_result/paws.common/R/paginate.R#L59-L85

DyfanJones commented 9 months ago

Here is a solution that aws js v3 uses. I believe we can implement something similar.

https://github.com/aws/aws-sdk-js-v3/pull/3524

DyfanJones commented 9 months ago

@wlandau I believe I have a solution to this based off the aws sdk* js v3:

remotes::install_github("dyfanjones/paws/paws.common", ref = "paginator_empty_result")

library(paws)

client = cloudwatchlogs(config(credentials(profile = "paws")))

pages <- paws.common::paginate(
  client$get_log_events(
    logGroupName = "/aws/sagemaker/NotebookInstances",
    logStreamName = "paws-demo/jupyter.log",
    startFromHead = TRUE
  ),
  StopOnSameToken = TRUE
)
length(pages)
#> [1] 2

Created on 2023-12-19 with reprex v2.0.2

Please try it out :)

DyfanJones commented 9 months ago

For completeness, here is the typescript smithy reference: https://github.com/smithy-lang/smithy-typescript/blob/main/packages/core/src/pagination/createPaginator.ts#L20-L58

export function createPaginator<
  PaginationConfigType extends PaginationConfiguration,
  InputType extends object,
  OutputType extends object
>(
  ClientCtor: any,
  CommandCtor: any,
  inputTokenName: string,
  outputTokenName: string,
  pageSizeTokenName?: string
): (config: PaginationConfigType, input: InputType, ...additionalArguments: any[]) => Paginator<OutputType> {
  return async function* paginateOperation(
    config: PaginationConfigType,
    input: InputType,
    ...additionalArguments: any[]
  ): Paginator<OutputType> {
    let token: any = config.startingToken || undefined;
    let hasNext = true;
    let page: OutputType;

    while (hasNext) {
      (input as any)[inputTokenName] = token;
      if (pageSizeTokenName) {
        (input as any)[pageSizeTokenName] = (input as any)[pageSizeTokenName] ?? config.pageSize;
      }
      if (config.client instanceof ClientCtor) {
        page = await makePagedClientRequest(CommandCtor, config.client, input, ...additionalArguments);
      } else {
        throw new Error(`Invalid client, expected instance of ${ClientCtor.name}`);
      }
      yield page;
      const prevToken = token;
      token = (page as any)[outputTokenName];
      hasNext = !!(token && (!config.stopOnSameToken || token !== prevToken));
    }
    // @ts-ignore
    return undefined;
  };
}
wlandau commented 9 months ago

Thanks for the patch, @DyfanJones! Unfortunately, I won’t be able to test until January when I get back from vacation. If #722 works on your end, please feel free to merge etc.

DyfanJones commented 9 months ago

No worries, I won't be pushing this to the cran before the new year. I am just going to go over the logic before finally merging :)

wlandau commented 9 months ago

Hi @DyfanJones, I hope you had a nice holiday.

I tested your work on this issue, both using https://github.com/paws-r/paws/issues/721#issue-2033258552 and https://github.com/wlandau/crew.aws.batch/blob/9714d49d32d94b1e044cd06f2fde0956f6ea556e/tests/monitor/test-jobs.R#L88-L130, and paws.common::paginate(StopOnSameToken = TRUE) works. Thanks for the patch!

By the way, I noticed the default argument of StopOnSameToken is FALSE. Which scenarios should it be TRUE, and which require StopOnSameToken = FASLE?

DyfanJones commented 9 months ago

That is great news 👍

The implementation mimics how the JavaScript AWS SDK V3 paginates with the default set as false or missing stopOnSameToken in the config, this also helps paws and keeps previous pagination behaviour for other services.

For knowing what service to set StopOnSameToken = TRUE, that is a tricky question (currently I am not sure). When checking the documentation for aws sdk js v3 (https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-client-cloudwatch-logs/Variable/paginateGetLogEvents/) it only says the following.

For some APIs, such as CloudWatchLogs events, the next page token will always be present.When true, this config field will have the paginator stop when the token doesn't change instead of when it is not present.

It might be worth having this documentation in paws as well. Sadly this isn't a solid answer. If anyone knows a better answer other than documentation I am happy add it in 😄

For completeness ====

Mini example for testing out aws sdk js v3 paginator logic.

let tokens = ["a", "b", "c", "c", "d"];

export interface PaginationConfiguration {
    stopOnSameToken?: boolean;
};

export function createPaginator(config: PaginationConfiguration): void {

    let hasNext: boolean = true;
    let token: string = undefined;
    let i: number = 0;

    while (hasNext) {
        const prevToken = token;
        token = tokens[i];
        hasNext = !!(token && (!config.stopOnSameToken || token !== prevToken));
        console.log(hasNext)
        i++;
    };
};

const paginatorConfig = { stopOnSameToken: true };
// const paginatorConfig = { };
// const paginatorConfig = { stopOnSameToken: false };

createPaginator(paginatorConfig);

// paginatorConfig.stopOnSameToken = true
// true
// true
// true
// false

// paginatorConfig.stopOnSameToken = false or
// paginatorConfig = {}
// true
// true
// true
// true
// true
// false
DyfanJones commented 9 months ago

Note: paws.common 0.7.0 has been released to the cran