kislyuk / aegea

Amazon Web Services Operator Interface
Apache License 2.0
68 stars 17 forks source link

kill processes using the filesystem before unmounting #34

Closed boris-dimitrov closed 6 years ago

boris-dimitrov commented 6 years ago

this appears to greatly improve the probability of success, particularly when done 3 different ways: via fuser, via lsof, and via killing all the shell's children

boris-dimitrov commented 6 years ago

Here is one successful job running with this: https://us-west-2.console.aws.amazon.com/batch/home?region=us-west-2#/jobs/queue/arn:aws:batch:us-west-2:423543210473:job-queue~2Fidseq_production_low_pri_stg1/job/479c5c21-1e7f-4f53-a338-f861276c9d73?state=RUNNING

boris-dimitrov commented 6 years ago

joint work with @yunfangjuan

boris-dimitrov commented 6 years ago

Another great success: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/batch/job;stream=aegea_batch/default/d6e61f32-43a8-4ba7-9636-281561a2e1f2

kislyuk commented 6 years ago

Thanks, can you update this to use apt-get update; apt-get install -y and also use apt-get to install httpie and any other dependencies that the shellcode has, outside the ubuntu standard docker image, that you can think of?

boris-dimitrov commented 6 years ago

I am still tweaking this and having some trouble killing just the processes (without killing the TRAP SIGEXIT handler).

Here are the processes using the mounted EBS filesystem according to LSOF.

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 305 root 3w REG 202,80 0 15466504 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/results/log.26e03af0-3afa-4627-8e1b-8427ebd43359.txt
sh 320 root 3w REG 202,80 0 15466504 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/results/log.26e03af0-3afa-4627-8e1b-8427ebd43359.txt
aws 321 root 3w REG 202,80 0 15466504 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/results/log.26e03af0-3afa-4627-8e1b-8427ebd43359.txt
aws 321 root 6w REG 202,80 2391056384 15466505 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/fastqs/FREZtarsABDO05_DNA_S9_R1_001.fastq.gz.74f9ab8e

Do these make sense?

The problem is when the SIGEXIT trap handler tries to kill these, it sometimes kills itself -- not immediately, but may be in 10-20 seconds.

Very odd.

Another thing that happens is that if the Docker container does not exit within 60 seconds of the job termination command, Amazon kills it anyway, and the volume does not get deleted.

So @yunfangjuan we may have no choice but to get a little bit more aggressive with those delays.

boris-dimitrov commented 6 years ago

Okay, finally tweaked it so it works better. Shorter delays were actually important to be able to delete the volume before AWS gives up on the Docker container.

Here is the proof that it is working: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/batch/job;stream=aegea_batch/default/ecffc32e-9489-4fff-8fe2-0b332c0f1207

It's a tight spot between avoiding the watchdog and the rate limiter.

boris-dimitrov commented 6 years ago

https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/batch/job;stream=aegea_batch/default/c71f4d5f-f649-4019-943b-9934a7fc2a97

boris-dimitrov commented 6 years ago

@kislyuk I believe this is now ready.