R-focused interprocess communication with jobs in Batch

wlandau commented 4 years ago

I see that paws is capable of submitting jobs to Batch. Can it also send and retrieve in-memory R objects? I am interested in potentially building on top of paws for https://github.com/wlandau/targets/issues/152 via https://github.com/HenrikBengtsson/future/issues/423 and https://github.com/mschubert/clustermq/issues/208. For the latter, just the ability to communicate between the master process and Batch workers over ZeroMQ sockets should be enough. cc @HenrikBengtsson, @mschubert.

wlandau commented 4 years ago

For reference, my use case is very similar to https://gist.github.com/DavisVaughan/865d95cf0101c24df27b37f4047dd2e5.

davidkretch commented 4 years ago

Oh wow, that would be super cool. Unfortunately (as far as I know) Paws itself can't send any in-memory objects to a Batch-run instance. All of Paws' operations are whatever AWS exposes through their APIs, and for Batch all those are about defining, describing, or starting jobs, and not communicating directly with the instances that get started.

Based on one of the links, if you're going to connect via SSH, it looks like you could get the names or IPs of the Batch instances by the unfortunately not straightforward path: 1) get the Batch container instance ARN identifier from Batch's DescribeJobs, 2) get the container instance's EC2 instance ARN from ECS's DescribeContainerInstance, 2) get the EC2 instance's public IP or DNS name from EC2's DescribeInstances.

All these have corresponding functions in Paws' batch, ecs, and ec2 clients, e.g. batch$describe_jobs.

The other pattern that I have seen before for communicating with Batch is to send data to an S3 bucket which the Batch job's Docker image would read in, and have the image write results back out to S3 as well. But that may not work so well for your use cases. That pattern was based on this blog post: Fetch and Run Batch Job. I made a basic R implementation of that here. There might be better ways by this point.

I can't say I have a lot of time or much experience with this particular application but I'd also love to help if I can. I'll look this weekend if I can find any more recent, better approaches to this.

wlandau commented 4 years ago

Glad you see as much value in this as I do! Thanks for those ideas, and I would love help exploring the possibilities. I understand if this is out of scope of paws itself.

So far, I have not had much luck with DescribeJobs. aws batch describe-jobs --jobs my_job just returns an empty list even when the job is running. Could just be a temporary setback.

The fetch-and-run approach definitely seems to better align with what Batch is designed to do. I suppose we may end up circling back to it. Still hoping for some kind of pub/sub interaction because it is theoretically simpler and faster and should help targets completely decouple cloud storage from cloud compute.

davidkretch commented 4 years ago

Hello, sorry for the delay. I haven't tried connecting to a Batch job instance via SSH yet, but I have put together this example code for getting the instance IP address:

# Submit a job
batch <- paws::batch()
job <- batch$submit_job(...) # Fill in job parameters...

# Get the job's info.
job_info <- batch$describe_jobs(job$jobId)

# Get the ARN for the ECS cluster that is running the Batch job's container.
job_queue_arn <- job_info$jobs[[1]]$jobQueue
job_queue <- batch$describe_job_queues(job_queue_arn)
compute_environment_arn <- job_queue$jobQueues[[1]]$computeEnvironmentOrder[[1]]$computeEnvironment
compute_environment <- batch$describe_compute_environments(compute_environment_arn)
cluster_arn <- compute_environment$computeEnvironments[[1]]$ecsClusterArn

# Get the ARN for the specific instance running the job.
container_instance_arn <- job_info$jobs[[1]]$container$containerInstanceArn

# Get the EC2 instance that the job is running on.
ecs <- paws::ecs()
container_instance <- ecs$describe_container_instances(
  cluster = cluster_arn,
  containerInstances = container_instance_arn
)
ec2 <- paws::ec2()
ec2_instance_id <- container_instance$containerInstances[[1]]$ec2InstanceId
ec2_instance <- ec2$describe_instances(InstanceIds = ec2_instance_id)

print(ec2_instance$Reservations[[1]]$Instances[[1]])

davidkretch commented 4 years ago

In addition to EC2 as you proposed, I will look into the pros/cons of Elastic Container Service and Fargate. I haven't actually used either of these. I will look more into your example links to see how they might work together.

wlandau commented 4 years ago

Wow, thank you so much! This is super generous of you, and these last couple posts have already helped enormously. Can't wait to try out https://github.com/paws-r/paws/issues/330#issuecomment-711440381, and Fargate looks promising for data analysis pipelines.

davidkretch commented 4 years ago

Hey, sorry for the delay. I am going to look into using Fargate a little this weekend. I saw on another issue that you're planning to use Batch and don't need to talk directly to the Batch workers. Is that now working?

wlandau commented 4 years ago

Hey, sorry for the delay. I am going to look into using Fargate a little this weekend.

It's okay, I got distracted by other things too. I appreciate your willingness to follow up.

I saw on another issue that you're planning to use Batch and don't need to talk directly to the Batch workers. Is that now working?

Right now it's just an idea. @mschubert explained that clustermq workers reverse-tunnel back to the main process, which makes me think it might be possible to use AWS Batch just like SLURM etc. I think that would be a fantastic direction for persistent workers, though if the implementation depends on me, it will be some time before I am familiar enough with command line Batch and clustermq internals. And even if we're successful, it only covers persistent workers, which can be more wasteful than transient workers in some use cases.

paws-r / paws

R-focused interprocess communication with jobs in Batch #330