pytroll / pytroll-collectors

Collector modules for Pytroll
GNU General Public License v3.0
3 stars 19 forks source link

Check for existing segments on S3 storage #117

Closed pnuu closed 1 year ago

pnuu commented 2 years ago

This PR adds a possibility to check for existing file segments on S3 storage when segment gatherer is started. This is already possible with local file systems by defining check_existing_files_after_start = True in the configuration file.

Not sure where to explain how the S3 configuration (end point, credential) is done.

pnuu commented 2 years ago

This is what I have in Trollmoves S3Mover docstring:

    All the connection configurations and such are done using the `fsspec` configuration system:

    https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration

    An example configuration could be for example placed in `~/.config/fsspec/s3.json`::

        {
            "s3": {
                "client_kwargs": {"endpoint_url": "https://s3.server.foo.com"},
                "secret": "VERYBIGSECRET",
                "key": "ACCESSKEY"
            }
        }
codecov[bot] commented 2 years ago

Codecov Report

Merging #117 (b0b7698) into main (62a1d71) will decrease coverage by 0.07%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #117      +/-   ##
==========================================
- Coverage   90.52%   90.46%   -0.07%     
==========================================
  Files          22       25       +3     
  Lines        3577     3848     +271     
==========================================
+ Hits         3238     3481     +243     
- Misses        339      367      +28     
Impacted Files Coverage Δ
pytroll_collectors/segments.py 92.22% <100.00%> (+0.10%) :arrow_up:
pytroll_collectors/tests/test_segments.py 100.00% <100.00%> (+0.72%) :arrow_up:
pytroll_collectors/__init__.py 100.00% <0.00%> (ø)
pytroll_collectors/geographic_gatherer.py 88.14% <0.00%> (ø)
pytroll_collectors/s3stalker.py 73.01% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pnuu commented 2 years ago

I created a new issue on documenting S3 configuration. It depends on completion of #110 and #114.

pnuu commented 2 years ago

It seems that when checking for the segments, the scheme part is not included when adding the segment to the message dataset. So they would be bucket-name/file1.seg instead of s3://bucket-name/file1.seg. I'll have a look at this tomorrow.

pnuu commented 2 years ago

Hngh, I have somehow messed up parts of the PR and don't know where some of the code came from..

pnuu commented 2 years ago

Ok, the results from s3fs.S3FileSystem.glob() don't include 's3://' in them. Now the collection works after a restart with usable URIs and I consider this completed (until review, at least :grin:).

As for the mysteriously appeared code fragment (https://github.com/pytroll/pytroll-collectors/pull/117/commits/53a8f74a038ff3297ae31814372c13134f7f2567#diff-76bd104754afbf31db6d4cf2bfa4f2df0dcb2fd57bbf7b9ca62b67be7cbeea8dL383), I have no idea where it came from. I just revered to the original (the version in main) for simplicity.

pnuu commented 2 years ago

Finally remembered to document the S3 configuration. @mraspaud ok to merge?

pnuu commented 2 years ago

With https://github.com/pytroll/pytroll-collectors/pull/117/commits/9af66a82b9aef392d5a4c3d8e2ba0636a0515a3c all glob() calls are done using fsspec.

adybbroe commented 1 year ago

@mraspaud Do you have time to look at this one again?

pnuu commented 1 year ago

@mraspaud one final look, please?