Closed boris-dimitrov closed 6 years ago
joint work with @yunfangjuan
Thanks, can you update this to use apt-get update; apt-get install -y
and also use apt-get to install httpie and any other dependencies that the shellcode has, outside the ubuntu standard docker image, that you can think of?
I am still tweaking this and having some trouble killing just the processes (without killing the TRAP SIGEXIT handler).
Here are the processes using the mounted EBS filesystem according to LSOF.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python 305 root 3w REG 202,80 0 15466504 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/results/log.26e03af0-3afa-4627-8e1b-8427ebd43359.txt
sh 320 root 3w REG 202,80 0 15466504 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/results/log.26e03af0-3afa-4627-8e1b-8427ebd43359.txt
aws 321 root 3w REG 202,80 0 15466504 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/results/log.26e03af0-3afa-4627-8e1b-8427ebd43359.txt
aws 321 root 6w REG 202,80 2391056384 15466505 /mnt/idseq/data/czbiohub-idseq-samples-production-samples-1-1-fastqs/fastqs/FREZtarsABDO05_DNA_S9_R1_001.fastq.gz.74f9ab8e
Do these make sense?
The problem is when the SIGEXIT trap handler tries to kill these, it sometimes kills itself -- not immediately, but may be in 10-20 seconds.
Very odd.
Another thing that happens is that if the Docker container does not exit within 60 seconds of the job termination command, Amazon kills it anyway, and the volume does not get deleted.
So @yunfangjuan we may have no choice but to get a little bit more aggressive with those delays.
Okay, finally tweaked it so it works better. Shorter delays were actually important to be able to delete the volume before AWS gives up on the Docker container.
Here is the proof that it is working: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/batch/job;stream=aegea_batch/default/ecffc32e-9489-4fff-8fe2-0b332c0f1207
It's a tight spot between avoiding the watchdog and the rate limiter.
@kislyuk I believe this is now ready.
this appears to greatly improve the probability of success, particularly when done 3 different ways: via fuser, via lsof, and via killing all the shell's children