ubarsc / rios

A raster processing layer on top of GDAL
https://www.rioshome.org
GNU General Public License v3.0
14 stars 7 forks source link

Implement parallel processing with AWS Batch #61

Closed gillins closed 1 year ago

gillins commented 1 year ago

And create a directory structure to support adding other AWS services in future.

A likely second candidate service would be AWS ECS where the cluster scales up and down according to load so we don't have to specifically start and stop jobs, although the details of doing this is beyond my current knowledge.

ping: @petescarth @tonykgill

gillins commented 1 year ago

Is there interest in supporting instance termination and restart for jobs that are submitted via SPOT pricing? We'd have to detect this and resubmit any partially completed jobs.

neilflood commented 1 year ago

OK, I have walked through this a couple of times, and I don't see much that seriously bothers me. Well done.

I do have some questions, largely trivial.

gillins commented 1 year ago

Thanks for looking at this @neilflood. Responses to your points below:

neilflood commented 1 year ago

Ah, very good, thank you. That all makes sense.

re: the bucket name. So, does that mean it is one shared bucket for all people using this facility in RIOS? If so, I think I am a bit troubled by that. Partly because of potential for filename clashes, but also because it does not seem like good security.

gillins commented 1 year ago

re: s3 bucket. Good call, they do have to be unique globally. But AWS permissions are enforced so if I create a bucket another user wouldn't be able to create it again or access it. I've appended the AccountId to the bucket name to make it more unique - there still could be issues with users sharing the same AccountId but that's their problem! I've also made the stack name and region env vars for now and documented them. If all ok I'll do some more testing and merge.

gillins commented 1 year ago

I should have also said - AFAICT all the other resources created by this CloudFormation just need to be unique for the account - the URL to the resource (SQS/ECR etc) has the account id already embedded.

neilflood commented 1 year ago

re: bucket. Yes, that seems much more like what I would wish. It should not be a shared thing, but should be created by them, with locked-down permissions.

I don't suppose AWS has an equivalent of Python's tempfile.mkdtemp()? That is really the sort of thing I would prefer.

Just did some Googling, and found this in the CloudFormation doc pages

BucketName
A name for the bucket. If you don't specify a name, AWS CloudFormation generates a 
unique ID and uses that ID for the bucket name.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-s3-bucket.html

I don't really understand how then get the unique name back from it, but if you can understand this, might that be the right thing to do?

re: other resources. Excellent, that is what you want, I think.

gillins commented 1 year ago

I don't suppose AWS has an equivalent of Python's tempfile.mkdtemp()?

Not that I'm aware of, no.

Just did some Googling, and found this in the CloudFormation doc pages

The name would be saved to the outputs of the CloudFormation stack like currently and the getStackOutputs() function would be able to return them.

But, I was worried if we did this the user would have no idea when they are looking at their s3 buckets that this one was created by RIOS hence why I was keen to have rios (or whatever they override ServiceName with) in the name. So they could accidentally delete the bucket without knowing it is part of the stack... At least if rios is in the name that might help.

neilflood commented 1 year ago

Ah, OK, fair point.

Anyway, I think that your compromise with the AccountID included in the name is probably good enough.

I think I am happy with it, so when you feel it is ready, go ahead and merge.

neilflood commented 1 year ago

Just one more small concern. How is this aws stuff visibly documented? Where does the README.md file show up? I can see that the information is there, but not at all sure I understand where one would be able to read it. Would that be better put in a file so that it comes up on ReadTheDocs?

I have no idea what is best, but it wasn't obvious how/where the current doc will come out.

gillins commented 1 year ago

Hmm yes good point. The README.md just shows up in github which is less than ideal since we usually point everybody to readthedocs. I think I was thinking about keeping the docs close to the code, or something. I'll have a go at fixing this up tomorrow.

gillins commented 1 year ago

OK have moved the doco into the docstring for batch.py. Does this seem alright @neilflood ? I can confirm that everything still seems to run.

neilflood commented 1 year ago

Thanks @gillins, that seems better. Should be as discoverable as anything else, now. No doubt there are things to improve, but that will come with time.

Is the line about Setting up your main script supposed to be a heading?

Also, it looks like there are some more ToC issues on these files. I must have missed this in rios_parallel_jobmanager.rst, and it also seems to be required for the new rios_parallel_aws_batch.rst. Could you do those in this PR? Just add :tocdepth: 2 at the top of those .rst files, see rios_applier.rst for an example.

gillins commented 1 year ago

Yes no doubt there will be more to tidy on this in time. I think this is all fixed now.

neilflood commented 1 year ago

Looks good :-)