pytroll / trollflow2

Next generation Trollflow. Trollflow is for batch-processing satellite data using Satpy
https://trollflow2.readthedocs.org/
GNU General Public License v3.0
10 stars 15 forks source link

Add a plugin to upload the generated images to S3 #158

Closed pnuu closed 1 year ago

pnuu commented 2 years ago

This PR adds a plugin that updloads the generated files to S3. ~Optionally~ Also deletes the local files after the transfer.

codecov[bot] commented 2 years ago

Codecov Report

Merging #158 (b8f6aed) into main (764661a) will decrease coverage by 0.35%. The diff coverage is 98.19%.

@@            Coverage Diff             @@
##             main     #158      +/-   ##
==========================================
- Coverage   95.70%   95.36%   -0.35%     
==========================================
  Files          11       13       +2     
  Lines        2491     2652     +161     
==========================================
+ Hits         2384     2529     +145     
- Misses        107      123      +16     
Flag Coverage Δ
unittests 95.36% <98.19%> (-0.35%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
trollflow2/launcher.py 87.22% <80.00%> (-0.62%) :arrow_down:
trollflow2/plugins/__init__.py 93.07% <100.00%> (+0.09%) :arrow_up:
trollflow2/plugins/s3.py 100.00% <100.00%> (ø)
trollflow2/tests/test_launcher.py 98.31% <100.00%> (+0.18%) :arrow_up:
trollflow2/tests/test_s3_plugins.py 100.00% <100.00%> (ø)
trollflow2/tests/test_trollflow2.py 98.38% <100.00%> (-1.13%) :arrow_down:
trollflow2/tests/utils.py 74.46% <100.00%> (+9.76%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

mraspaud commented 2 years ago

@pnuu thanks for the work in this PR. I thought a little about this, and I think though that what is being added here is in fact a dispatching mechanism, for which we have a process in trollmoves: https://github.com/pytroll/trollmoves/blob/main/trollmoves/dispatcher.py I think the benefit with the dispatcher is that no time is lost dispatching/uploading the file inside the trollflow run...

pnuu commented 2 years ago

Ok, I've tested this PR with proper reading and saving, and it works.

There is just the trollflow2.launcher.check_results function that is logging ERROR messages due to the files are not available anymore after S3 upload and subsequent deletion. Any thoughts what to do about this @mraspaud?

pnuu commented 2 years ago

Still some work needed in the messaging, the published messages have URIs like this: /tmp/s3:/bucket-name/20220908_1050_Meteosat-10_EPSG3035_wv73.tif. So the temporary save path is there, and also a slash is missing after the scheme.

pnuu commented 2 years ago

With the latest commit the messages are now working. Still need to figure out what to do with trollflow2.launcher.check_results.

pnuu commented 2 years ago

@pnuu thanks for the work in this PR. I thought a little about this, and I think though that what is being added here is in fact a dispatching mechanism, for which we have a process in trollmoves: https://github.com/pytroll/trollmoves/blob/main/trollmoves/dispatcher.py I think the benefit with the dispatcher is that no time is lost dispatching/uploading the file inside the trollflow run...

Hadn't noticed this comment. The dispatcher would work when running on a (virtual) server with multiple tasks, but not when running in a container where there's supposed to be only a single process running and the saving is done inside the container. Using a shared volume might work, but that would also add a new container and set of configuration files.

pnuu commented 2 years ago

This now uses trollmoves.movers.S3Mover instead of direct s3fs calls.

pnuu commented 2 years ago

It seems that the upload is done to s3://<bucket name>/<image name>/<image name>...

pnuu commented 2 years ago

Ok, it was a usage error that was revealed when uploading to real S3. The filename was interpreted as a prefix ("directory") by botocore. Fixed with https://github.com/pytroll/trollflow2/pull/158/commits/5d53219409ae5e5f0c5b13429ff83001de33941e

pnuu commented 2 years ago

Ok, I've tested this PR with proper reading and saving, and it works.

There is just the trollflow2.launcher.check_results function that is logging ERROR messages due to the files are not available anymore after S3 upload and subsequent deletion. Any thoughts what to do about this @mraspaud?

I'm now thinking that maybe the S3 uploading and local file deletion should be separate plugins :thinking: The other options seem more complicated.

pnuu commented 2 years ago

Nevermind, that won't work either because check_results() is called after all the plugins. The filenames are put into a queue in save_datasets plugin, so updating the filenames to S3 variants and modifying check_results() to do the check over S3 won't work directly either.

One option would be to add check_results(..., remote_filesystem=None) kwarg and with minor refactoring handle each protocol as the need comes. For S3 that would be something like

def _check_s3_file(saved_file):
    s3 = S3FileSystem()
    return s3.stat(saved_file)['size']

which would raise FileNotFoundError that is already handled.

@mraspaud any other ideas?

mraspaud commented 2 years ago

I thought a bit more about this, to try to keep things clean and separated. Could we have a custom class for taking care of files? Something that mimics a python file object for example, and that uploads the file on close for example?

pnuu commented 2 years ago

I think this should work for what I need. I'll add a test for this tomorrow unless there's something more urgent happening.

pnuu commented 2 years ago

I think this is now closer to ready, and all comments should have been handled in a way or the other.

pnuu commented 1 year ago

WTH did that feature-s3-upload-plugin came from?! It's now in 18 random places in test_trollflow2.py :man_facepalming:

pnuu commented 1 year ago

I ran a test in OpenShift with the latest commits and #159 (merged locally) and everything is still working.