usnistgov / WIPP-backend

Web Image Processing Pipeline (WIPP) - backend
Other
5 stars 10 forks source link

Invalid file chunk number errors #169

Open tejavegesna opened 3 years ago

tejavegesna commented 3 years ago

Summary

Uploaded File chunks rerouted to different pods. Which is giving us invalid chunk number errors If we have autoscaling, does this mean that each of those 10 chunks gets sent to a different pod?

What is the current bug behavior?

When you upload files to WIPP, you cut a file up into 1MB chunks rather than just creating a continuous file stream So if we have a 10MB file, we send 10 (1MB) chunks of files to the backend Nginx is routing different file chunk parts to different pods If you look at the logs attached, it is showing that there are invalid flow chunks

What is the expected correct behavior?

No chunk errors should be there when the file conversion happen

Steps to reproduce

@Nicholas-Schaub Uploading 1500 Images 5 Replica K8s running which is scaled using Horizontal pod Autoscaler (HPA) Min Pods 1 and Max pods 5 Cpu Requests: 1 Cpu Limits: 2

Relevant screenshots and/or logs pod 1 running initially And pod2 & pod3 started by autoscaling activity pod1.txt pod2.txt pod3.txt

Environment info

labshare/wipp-backend:3.0.0-generic

Possible fixes

Not So Sure

cc: @Nicholas-Schaub

MyleneSimon commented 3 years ago

Hi @tejavegesna is it still happening after the 2 additional pods have been running for a while? Or when you were hard-coding the number of replica (as opposed to autoscaling)?

Looking at the time stamps in the logs, I am wondering if it might be an issue with the readiness of the pod/app (since we don't have readiness probe for wipp-backend, the pod might be ready before the app actually is). So some of the chunks would have been sent to pods 2 and 3 right when the autoscaling kicked in (and before they were actually ready to receive the chunks) and then that would mess up the whole chunk registration and image conversion. Not saying this is the only issue here, but just wanted to check that first if you get a chance to test that.

tejavegesna commented 3 years ago

@MyleneSimon We saw the issue when pods are autoscaled using Horizontal Pod Autoscaler

And yes this even happens when the number of pods are static like we tried with 2 pods & no autoscaler involved

MyleneSimon commented 3 years ago

@tejavegesna thanks for testing, I checked the backend chunk upload code and there is a ConcurrentMap there that I am afraid might not be playing well with the pod replication... But I will investigate a bit more to make sure this is the issue here. In the meantime can you guys fo back to 1 for the number of replica and do some scaling with the ome.converter.threads value?