Closed simonff closed 6 years ago
Having multiple providers is for me about giving proper credit to all parties that put effort into a dataset. That's more a "political" thing than always useful for a user, but is also a part of provenance, I think.
That said, I think this is part of a much broader discussion about how we organize provenance information, so #179 and #226 are closely related issues. Let's imagine we have a standard for describing applied processes (#179), provider information (this issue) and source information (#226). How would we like to have this organized? We probably want to encapsulate them some how. So maybe let's think about that and design is as we want it to be and just let the missing things out so we can add them later to core or via extension. In an ideal world we would probably just link back to the source STAC catalog, but that is unlikely to be possible. ;-)
Just a quick example to get discussions started, probably not very well-thought out:
{
"name": "Sentinel-2A",
"provider": [
{
"name": "ESA",
"url": "http://www.esa.int",
"processing": {
... Depends on the standard provenance / processing standard
},
"source": {
"scheme": "S3",
"id": "a-bucket-id",
"region": "us-east"
},
... Anything else? Processing level for example?
},
{
"name": "Google",
"url": "http://www.google.com",
"processing": {},
"source": {
"scheme": "GCS",
"id": "another-bucket-id"
},
... Anything else? Processing level for example?
}
],
... More dataset fields
}
This could also include the host now, which would be either the last or first entry, depending on what we decide upon.
Regarding examples: Don't you already have examples in your GEE catalog? For example 1, 2 and 3 - and I just looked at a few of them.
https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C01_T1_TOA (provider=USGS/Google) represents light processing (applying TOA) done by Google, and adding ourselves as an extra provider was the least bad option given that we don't show processing chain.
https://developers.google.com/earth-engine/datasets/catalog/JRC_GSW1_0_MonthlyRecurrence (provider EC JRC / Google) was created as a single product as collaboration between Google and JRC. On one hand, we want such products to be findable by querying for each organization individually, on the other hand, there's only a single end product with a single homepage (http://global-surface-water.appspot.com/). So adding two independent Provider objects is not right either. We could handle this by adding multiple organization names within a single Provider, to more accurately model the real situation.
https://developers.google.com/earth-engine/datasets/catalog/FIRMS (provider NASA / LANCE / EOSDIS) is even more confusing. This is still a single end product with the homepage https://earthdata.nasa.gov/earth-observation-data/near-real-time/firms. We should not have mentioned EOSDIS at all, I think - it's an infrastructure system within NASA. LANCE is sort of part of NASA, so listing them as two separate providers does not make sense either - there are no two separate NASA and LANCE pages for FROMS. NASA / LANCE should be read like EU / France.
In your processing example, I don't think Google should be listed as a provider, because then we'll be a provider for everything hosted in EE. Ok, for the Sentinel 2 copy on GCS that we maintain we provide some light processing (unpacking zip archives), so the GCS copy is not a 100% mirror and mentioning this transformation as processing step is warranted.. However, we should probably distinguish data processing (eg, transforming TOA to SR, that is, top-of-atmosphere imagery to surface reflectance) and metadata processing (unzipping files, converting formats etc). In the latter case the data should, generally speaking, be intact, and I'd prefer to clearly differentiate these two situations.
You have interesting constellations in your catalog ;-) - thanks for explaining those.
In your processing example, I don't think Google should be listed as a provider
In my thoughts I had the host removed of course and therefore Google would be in there as hosting provider. That Google did not processed the data is made clear by setting processing to {}
(i.e. no further processed).
then we'll be a provider for everything hosted in EE.
Yes, that's intentional. You'd also be listed in all datasets as host in the current specification (or you'd ignore the host property, but you could also just not add yourself to the providers in this case.)
and I'd prefer to clearly differentiate these two situations.
That's hopefully possible at some point with the provenance extension, until then we are somewhat unclear regarding that.
Before we get into processing, I still would like to propose changing Datasets from multiple Provider objects with a single name+link for each to a single Provider with a single link and possibly multiple names (to handle situations like JRC / Google above). Compare with scientific papers that may have multiple authors but a single DOI and title. There should be only one definitive link explaining what the dataset is, I think.
I don't want to propose something for "processing" here, that's just a rough idea about how this could be handled with a processing extension in the future.
By removing the array for providers I'd be missing an option to at least provide credit to the provider capturing the images/data (e.g. ESA, NASA, Planet, ...).
And as described in the current spec there is one definitive link explaining the dataset, it's the last one as that is the provider that processed the data last.
Seems we need some more opinions on this one, but better sooner than later otherwise we have a release out with either on of these options that we may not agree on any longer.
I don't think providing credit to a data provider with a link to their home page is especially meaningful. If someone credits ESA, it's fairly obvious that the homepage will be https://www.esa.int/
In my mind, the main reason for adding the Provider field is to highlight THE definitive pointer explaining what the data actually is in exhaustive detail. Not being able to easily find the explanation for the data is a huge problem for dataset usability. If we give the option of having multiple provider links, people will misuse it and will not always put the main provider last.
I think giving the ability to list multiple provider names is enough for providing credit in most cases. If catalog owners feel that generic links to homepages are important, these links can go into the generic 'links' dictionary.
For the 500+ datasets in Earth Engine, I have not seen a single case when there need to be two provider links of equal weight.
Now that we removed the host and further worked on the multi provider definition after the last telco, I think this issue needs to be discussed again after we get feedback from a 0.6.0 release.
We have more and more settled for the multi provider field and it seems people use it, so I'd say this is the current state for now?!
I suggest shifting to single object to reduce complexity. (In the internal EE catalog, I started with providers a list and found that this is not necessary in 99.9% of cases, and the remaining 0.1% is better handled by provenance.) I'm open to revisiting this, but I'd like to see more examples first.