mixed marine/non marine IPTs

pieterprovoost commented 4 years ago

In the past we have promoted the use of the line marine, harvested by OBIS in the additionalMetadata EML element to indicate datasets to be harvested from IPT. Because dataset tagging was very incomplete, this rule has not been implemented in the new OBIS backend. There currently are two options for partial harvesting of a specific IPT:

all datasets are harvested except the ones entered manually in a list in the OBIS database
only specific datasets, again entered manually in a list in the OBIS database, are harvested

Because I think the marine, harvested by OBIS is not a very clean and generalizable solution, I suggest to use keywordSet instead, with the IUCN Habitats Classification Scheme as vocabulary. There already is a matching vocabulary on the GBIF server: http://rs.gbif.org/vocabulary/iucn/habitat.xml

For IPTs marked as such, OBIS would then check the keywords in the EML to see if any marine habitats are present before harvesting a datasets.

In IPT it could look like this:

Datasets containing freshwater/terrestrial as well as marine records is a related issue but I would like to discuss that in a different thread.

albenson-usgs commented 4 years ago

On the one hand, this definitely seems like a more elegant and machine readable option for implementation. On the other as a node manager with 91 datasets in my IPT that I would need to apply this to- many datasets which I would probably not receive feedback from data providers- this seems both daunting to implement and prone to the same issues as "marine, harvested by OBIS" but made more complicated by the fact that there are decisions to make on selecting the correct vocabulary terms. At least with "marine, harvested by OBIS" it's straightforward and doesn't require data provider input.

skybristol commented 4 years ago

There is a certain elegance to the proposed method of incorporating marine environment keywords from a vocabulary in the metadata and using that to drive high level decision making on inclusion in the OBIS index. In general, I think that would be good metadata to have in OBIS datasets anyway. However, I do see the practical implications that @albenson-usgs describes, given that what this really means is an additional property to be figured out in the data.

I do agree that some more robust and generalizable solution needs to be figured out and specified. As we've discussed, OBIS eventually needs to work in other sources of data beyond IPT-DwCA, and we'll need to deal with other kinds of cases that don't fit this model at all. I assume that at the highest level, you have essentially a registry of IPT sources that are tied to the OBIS Nodes that operate them. (I tried to find this under metadata.obis.org, but the "IPT source registry" doesn't seem to be hooked in there.) This is your first line of "go harvest this stuff," and I would question whether or not it can be left at just that level.

Under the new harvesting/data processing pipeline backend, what is the relative cost in terms of resource usage and any necessary human decision making when a dataset or part of a dataset is encountered where records are "rejected" from the final OBIS index based on the records being not applicable for OBIS? If that cost is negligible because the process can basically run itself at this point or it can run asynchronously with whatever resources are allocated, even if rejecting individual records means a given dataset may take longer to complete processing, then maybe this is not that big a deal.

If it is a high system processing cost, then I think I would still take the approach of first "blacklisting" specific datasets in the registry config as a OBIS admin function - e.g., include everything at a deliberately registered IPT source by default, identify datasets on an ignore list if they are found to be irrelevant to OBIS purposes. If it's not a high processing cost, then all or part of datasets encountered at a registered IPT get dropped in processing as not containing information that OBIS wants.

diodon commented 4 years ago

meanwhile, I may suggest to instruct the node managers to start including the specific vocabulary for the habitats (IUCN looks like the right stuff) and produce some detailed documentation on how to do it. Also, keep the marine, harvested by OBIS text in the additional metadata field.

pieterprovoost commented 4 years ago

@skybristol The IPT (and other) feeds are listed under the nodes here: https://api.obis.org/node and I just added an endpoint for the blacklist: https://api.obis.org/dataset/blacklist

There currently is no human decision making cost for mixed datasets, the system will just reject taxa that cannot be matched with WoRMS. The resource cost is manageable as most datasets have a large marine component. The only issue here is that the non matching and non marine names show up on the portal as a potential quality issue.

For mixed feeds, the resource cost very much depends on the nature of the IPT. I think we will increasingly want to harvest just a few datasets off an IPT which predominantly has terrestrial or freshwater datasets and processing everything could be costly in terms of computing resources. So in those cases there would be a human curation cost, and I wonder if the ten minutes spent on getting these datasets into the OBIS registry would not be better spent adding (very useful) habitat metadata into the EML.

JoBeja commented 4 years ago

Hi all, The EurOBIS team is actually dealing with a dataset that fits this issue, very timely! I think we will need to find a compromise, my take is:

For previously published datasets, no changes will be done unless said datasets are updated For datasets that are new or updates:
If we can contact the data originator an agreement on the choice of keywords to use should be found and they should be used in the IPT
If we can't contact the data originator and the choice of keywords is obvious (e.g. it is described in the supporting metadata or documentation or can be easily matched to the IUCN keywords), we should use them
If we can't contact the data originator and the choice of keywords is not obvious (e.g. not enough information, no clear match to the IUCN keywords) we should continue with the "marine, harvested by OBIS" as is current practice.

For our dataset the last option applies as we can't contact the data originator and there isn't enough information for us to choose what the keywords should be. Is this a good approach?

Antonarctica commented 4 years ago

For the Antarctic community the recommended keywords are those of the Global Change Master directory.

https://gcmd.nasa.gov/search/Keywords.do#keywords

davewatts3 commented 4 years ago

Back in the good old days (DiGIR), I was hosting marine and terrestrial data on the same server. I had a dataset with both types, so split the records into a marine set and non-marine based on the collecting location. I then informed OBIS which ones to harvest from. Bit kludgy but it was the easiest at the time with no other logical ways to mark up the data. The IUCN keywords looks like a nice possibility.

iobis / obis-issues

mixed marine/non marine IPTs #167