pepijn-devries / CopernicusMarine

Subset and download marine data from EU Copernicus Marine Service Information. Import data on the oceans physical and biogeochemical state from Copernicus into R without the need of external software.
https://pepijn-devries.github.io/CopernicusMarine/
GNU General Public License v3.0
24 stars 3 forks source link

cms_list_stac_files does not work for single-file datasets #36

Open raymondben opened 7 months ago

raymondben commented 7 months ago

Thanks for the great work with this package. Found a small problem for datasets that consist of only a single file. e.g. the MDT dataset:

> cms_list_stac_files(product = "SEALEVEL_GLO_PHY_MDT_008_063")
# A tibble: 0 × 0

It is happening because their API is returning the actual file, not its bucket, when you query the stac properties:

> cms_stac_properties(product = "SEALEVEL_GLO_PHY_MDT_008_063")$href
[1] "https://s3.waw3-1.cloudferro.com/mdl-native-07/native/SEALEVEL_GLO_PHY_MDT_008_063/cnes_obs-sl_glo_phy-mdt_my_0.125deg_P20Y_202012/mdt_hybrid_cnes_cls18_cmems2020_global.nc"

cms_list_stac_files tries to issue a list-bucket request to this URL, which of course doesn't work.

I have a workaround for my own needs, but it would be good to fix. I have not provided a PR because I don't know the best solution. You could perhaps detect the fact that the href ends with an actual filename and throw that part away. But reliably detecting filenames might not be straightforward. Known file extensions or perhaps even just paths that end with "." followed by two or three more characters, but either way seems like it would be fragile.

I don't think you can rely on href having a predictable structure (e.g. https://host/*-native-*/native/DATASET_ID/LAYER/FILE) because I am guessing that there could be additional subdirectories in between LAYER and FILE. (But if that's not the case, then this might work. Just throw away anything after the 7th element in https://github.com/pepijn-devries/CopernicusMarine/blob/master/R/cms_list_stac_files.r#L12).

You definitely cannot rely on the actual URL in the href. For the example above, you can see that it's pointing to the file mdt_hybrid_cnes_cls18_cmems2020_global.nc. But that file doesn't actually exist, and when you do a bucket-list query on the bucket, it turns out that the file is called something else. That seems like an error from Copernicus, but nonetheless I think you still have to go through the bucket-list step.

raymondben commented 7 months ago

(Also, minor suggestion that I stumbled across while debugging this: you don't need https://github.com/pepijn-devries/CopernicusMarine/blob/master/R/cms_list_stac_files.r#L7. Just put a .data$ prefix on assets in line 12.)

pepijn-devries commented 7 months ago

Hi @raymondben,

Thank you for the detailed report. This is a great help to improve the package. I will study your case and think about how to best handle the case where STAC responds with just the file, instead of a bucket. Your suggestions are really helpful

(Also, minor suggestion that I stumbled across while debugging this: you don't need https://github.com/pepijn-devries/CopernicusMarine/blob/master/R/cms_list_stac_files.r#L7. Just put a .data$ prefix on assets in line 12.)

This is also a good point. I think that assets <- NULL is a relic from an earlier version where I didn't import rlang's pronoun .data. Your suggestion would make the code easier to read. I will update this.

I'll leave this issue open until I have decided on a definitive solution

Cheers,

Pepijn