sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

Build stays in active state after completing #115

Closed johnml1135 closed 1 year ago

johnml1135 commented 1 year ago

On qa.serval-api.org:

[
  {
    "id": "64f6360e6f044714a98d9ced",
    "url": "/api/v1/translation/engines/64f6360c6f044714a98d9cd9/builds/64f6360e6f044714a98d9ced",
    "revision": 3,
    "engine": {
      "id": "64f6360c6f044714a98d9cd9",
      "url": "/api/v1/translation/engines/64f6360c6f044714a98d9cd9"
    },
    "pretranslate": [
      {
        "corpus": {
          "id": "64f6360e6f044714a98d9cec",
          "url": "/api/v1/translation/engines/64f6360c6f044714a98d9cd9/corpora/64f6360e6f044714a98d9cec"
        }
      }
    ],
    "step": 10,
    "state": "Active"
  }
]

image

What is going on?

johnml1135 commented 1 year ago

And it failed -> from ... machine_jobs.hangfire.jobgraph image

But from serval.translation.builds: image

But from the logs, the only thing of note before it completed was a token refresh: image

It appears that when the token refreshed right before it failed. Now, it failed at the same time that the job ended (15:58:04 in clearml and but 19:58:04 in MongoDB) but it could be that because of the token refresh, it got lost in some way. So there appear to be 3 issues:

johnml1135 commented 1 year ago

Actually, here is the hangfire error:

Amazon.S3.AmazonS3Exception: Access Denied
 ---> Amazon.Runtime.Internal.HttpErrorResponseException: Exception of type 'Amazon.Runtime.Internal.HttpErrorResponseException' was thrown.
   at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
   at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RedirectHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.S3.Internal.AmazonS3ResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   --- End of inner exception stack trace ---
   at Amazon.Runtime.Internal.HttpErrorResponseExceptionHandler.HandleExceptionStream(IRequestContext requestContext, IWebResponseData httpErrorResponse, HttpErrorResponseException exception, Stream responseStream)
   at Amazon.Runtime.Internal.HttpErrorResponseExceptionHandler.HandleExceptionAsync(IExecutionContext executionContext, HttpErrorResponseException exception)
   at Amazon.Runtime.Internal.ExceptionHandler`1.HandleAsync(IExecutionContext executionContext, Exception exception)
   at Amazon.Runtime.Internal.ErrorHandler.ProcessExceptionAsync(IExecutionContext executionContext, Exception exception)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Signer.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.S3.Internal.AmazonS3ExceptionHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
   at SIL.Machine.AspNetCore.Services.S3FileStorage.Rm(String path, Boolean recurse, CancellationToken cancellationToken) in /app/src/SIL.Machine.AspNetCore/Services/S3FileStorage.cs:line 94
   at SIL.Machine.AspNetCore.Services.SharedFileService.DeleteAsync(String path, CancellationToken cancellationToken) in /app/src/SIL.Machine.AspNetCore/Services/SharedFileService.cs:line 71
   at SIL.Machine.AspNetCore.Services.ClearMLNmtEngineBuildJob.RunAsync(String engineId, String buildId, IReadOnlyList`1 corpora, CancellationToken cancellationToken) in /app/src/SIL.Machine.AspNetCore/Services/ClearMLNmtEngineBuildJob.cs:line 217
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
johnml1135 commented 1 year ago

image

ddaspit commented 1 year ago

It looks like it crashed in the exception handler of the job prior to updating the state and notifying Serval.

ddaspit commented 1 year ago

This is another reason why it would be useful to get rid of the NMT Hangfire job.