tdwg / dwc

Darwin Core standard for sharing of information about biological diversity.
https://dwc.tdwg.org
Creative Commons Attribution 4.0 International
206 stars 70 forks source link

Controlled vocabulary for the occurrenceStatus property #342

Open baskaufs opened 3 years ago

baskaufs commented 3 years ago

The term occurrenceStatus is undergoing revision in #339 (Note: term also mentioned in #238 but that issue is not directly related to this one). As part of that revision, we should officially create the controlled vocabulary that goes with the term since it is very straightforward and the CV would have only two terms. The design pattern will follow that used in the existing Darwin Core controlled vocabularies.

The controlled vocabulary would consist of the following two terms, which use as controlled value terms the strings currently listed at https://rs.gbif.org/core/dwc_occurrence_2020-07-15.xml and as examples in the current term metadata: http://rs.tdwg.org/dwc/terms/occurrenceStatus:

English language label: present controlled value string: present definition: a Taxon is present at a Location

English language label: absent controlled value string: absent definition: a Taxon is absent at a Location

When the terms are generated, they will be assigned IRIs according to the existing design patterns for controlled vocabularies in Darwin Core. The list of terms page for the controlled vocabulary will contain instructions that the controlled value strings MUST be used as values for dwc:occurrenceStatus and that term IRIs MUST be used as values for dwciri:occurrenceStatus.

tucotuco commented 3 years ago

Might it be better to be more explicit in the definitions, something like 'an Organism was present at the designated place and time', to avoid the misconception that the term can be applied to a Taxon?

tucotuco commented 3 years ago

Or more in keeping with the proposed term definition, 'an Organism was present within the bounded place and time'.

nielsklazenga commented 3 years ago

For me, there are at least two terms missing that were in the previous GBIF Occurrence Status Vocabulary:

I would also like to make a case for extinct, which the (old) GBIF vocabulary had as a synonym of absent. extinct is secondarily absent: the taxon has been present in the area once. extinct is also in threat status, but there it cannot be used with introduced occurrences.

deepreef commented 3 years ago

doubtful...literature reports that do not cite specimens or cite specimens that cannot be located

I would not automatically brand unvouchered reports of occurrences as "doubtful". I would guess that in the vast majority of cases, they represent legitimate occurrence instances of organisms that the reporters would have identified as the indicated taxon. I think there may need to be some other indication that there is no explicit evidence (specimen, image, etc.), which doesn't explicitly cast doubt on the Occurrence, but captures the apparent absence of evidence. Perhaps something like "unvouchered" or "unverified"? If I had to choose between "present" and "doubtful" for voucher-less published occurrence reports, I would choose "present".

I would also like to make a case for extinct, which the (old) GBIF vocabulary had as a synonym of absent. extinct is secondarily absent: the taxon has been present in the area once. extinct is also in threat status, but there it cannot be used with introduced occurrences.

I would reserve the word extinct to mean "from planet Earth". For our systems, we use the term extirpated to indicate that a taxon once occurred in an area (naturally or introduced) and has since been eliminated from that area (but not entirely from planet Earth).

Having said that, I'm queasy about framing instances of "Taxon-at-Location" as Occurrence instances. I would rather see instances of Occurrence be more strictly defined as "Organism-at-Event". I think we will ultimately need a new class to represent properties of instances of "Taxon-at-Location"; although I realize that people also use Occurrence for this purpose (a practice that is encouraged by terms such as establishmentMeans, degreeOfEstablishment, pathway, and occurrenceStatus being organized within the Occurrence class).

I don't think we're ready for this leap just yet, but perhaps next year we can think about a new class along the lines of TaxonLocation within which we can organize these (and other -- e.g., threatStatus?) terms.

nielsklazenga commented 3 years ago

I would not automatically brand unvouchered reports of occurrences as "doubtful".

My bad. Me neither. The occurrence status for these reports themselves would be present. I was (or meant to be) talking about statements in literature that specifically say the occurrence of a taxon in a given area is doubtful. I see now that I totally did not make that clear. These doubtful statements are probably not equally common in all taxonomic groups, but they are quite common in the taxonomic group I work on (bryophytes). There is quite a bit of sleuthing involved in making these statements; not just a data entry job.

I would reserve the word extinct to mean "from planet Earth". For our systems, we use the term extirpated to indicate that a taxon once occurred in an area (naturally or introduced) and has since been eliminated from that area (but not entirely from planet Earth).

I am fine with extirpated instead of extinct.

Having said that, I'm queasy about framing instances of "Taxon-at-Location" as Occurrence instances. I would rather see instances of Occurrence be more strictly defined as "Organism-at-Event". I think we will ultimately need a new class to represent properties of instances of "Taxon-at-Location"; although I realize that people also use Occurrence for this purpose (a practice that is encouraged by terms such as establishmentMeans, degreeOfEstablishment, pathway, and occurrenceStatus being organized within the Occurrence class).

I would like that too, to separate the primary occurrence data from the derived data. We would still use the same terms (occurrenceStatus etc.) though, as is in done the GBIF Species Distribution Extension.

deepreef commented 3 years ago

So... having thought about it a bit more, I think an alternate (and perhaps superior) approach to the "Taxon-at-Location" issue is to keep it as Occurrence (i.e., Organinsm-at-Event), but expand the potential scope of Organism to include something like "population". When that class was being discussed, we talked about how far up the aggregate scale was best to go. Clearly it needed to go beyond "single organism" both to accommodate colonial things (corals, and/bee colonies, etc.), and also "packs"/"pods" (e.g., wolves, whales). However, although we did discuss extending out as far as "Populations", people got queasy about that (myself included) because at some point a "population" blends into a Taxon.

However, after having worked with this more directly (especially for the Plants of Hawaii project), I am increasingly seeing the value of using "population" as an allowable scope for Organism, so long as it remains taxonomically homogeneous. basically, it's a way of recognizing aggregate sets of Organisms starting at "one" and increasing all the way up to (and including) the most granular flavor of "Taxon". This is both logically consistent, and allows us to represent "Taxon-at-Location" instances as Occurrences, with the Organism part being some unit of population, and the Event part capturing both space and time properties.

More thought required.

nielsklazenga commented 3 years ago

@deepreef, I share your (and other people's) uneasiness, but I think the definition of Organism is already there:

Organism A particular organism or defined group of organisms considered to be taxonomically homogeneous.

Taxon A group of organisms (sensu http://purl.obolibrary.org/obo/OBI_0100026) considered by taxonomists to form a homogeneous unit

My uneasiness, by the way, already starts when the definition of Organism starts to deviate from what I would consider an organism, so with the pack of wolves or the pod of dolphins. I do not think there is a danger of an Organism ever becoming a Taxon.

deepreef commented 3 years ago

Thanks @nielsklazenga :

but I think the definition of Organism is already there:

Yes., and that definition was crafted that way to leave the door open for including "populations" (which can certainly fall within the realm of a "defined group of organisms considered to be taxonomically homogeneous").

So... the definition does not need to change -- just the Comments (expand the parenthetical to "(such as packs, clones, and colonies, and populations)"), and the Examples needs another sentence, "A defined population of organisms belonging to the same taxon."

I do not think there is a danger of an Organism ever becoming a Taxon.

I agree. That was one of the reasons the original scope stopped short of explicitly including populations; but I think that is the lesser concern than the fact that DwC currently has no other way to represent a "population" -- but clearly that is a unit of biological instance many of us would like to assign properties to, and subsequently share. If we allow the extent of a population to reach (but not exceed) "every organism on earth belonging to a specified Taxon", then that becomes a fairly clear demarcation between where an Organism ends, and a Taxon begins.

peterdesmet commented 3 years ago

doubtful is a property that also came up for biologging data. E.g. an outlier occurrence on a track. But I think it's better to leave the occurrenceStatus vocab to present and absent and handle outliers another way (e.g. remove them, give them a high coordinateUncertaintyInMeters, ...). Ping @tdwg/biologging

nielsklazenga commented 3 years ago

I agree that this use case is not a use case for occurrenceStatus:doubtful, but maybe for a high coordinateUncertaintyInMeters, but that does not mean that there are no valid use cases for occurrenceStatus:doubtful. coordinateUncertaintyInMeters only makes sense if there is a georeference, which is not always the case. Also, in almost every case where I would want to use doubtful, it is the Identification of the organism that is in question, not its Location, so enlarging the area (by giving a higher coordinateUncertaintyInMeters) is not going to make the occurrenceStatus any less doubtful.

doubtful is in a vocabulary that has been used for occurrenceStatus for a long time, so you cannot really talk about "leaving" the vocab to present and absent, as that is a change of current practice. If there is going to be a normative vocabulary on occurrenceStatus, it needs to include all the terms that are needed to get a reasonably nuanced representation in all use cases of occurrenceStatus. Unless these terms do not belong in occurrenceStatus and are better accommodated some place else, but nobody has made that argument for doubtful (or excluded, or extirpated).

sarahcd commented 3 years ago

From the machine observation side, I am not too familiar with use cases in Darwin Core but would be hesitant to use coordinateUncertaintyInMeters and assume a consistent result: Different sensors have different ways of reporting the quality of location estimates, and it is common for there to be no quality estimate at all, and for very large outliers to be identified by the sensor as being a good quality location. How would this differentiate location uncertainty provided by a sensor vs through some later processing method.

Probably, 'outliers' would be ideally removed prior to putting into Darwin Core. But keep in mind what a user will define as an outlier will depend on context, commonly something like, "is a location an outlier that is in the right area but probably 1-5 km off?", which depends on the scale of the analysis they are thinking about.

deepreef commented 3 years ago

So... the distinction here is about what, exactly, is "doubtful". If we accept that an instance of Occurrence represents the intersection of an Organism and an Event, then the most proximal interpretation of an occurrenceStatus value of "doubtful" would be "it is doubtful that this Organism occurred at this Event".

Simply increasing the value coordinateUncertaintyInMeters only addresses one aspect of the potential doubt: the "where" part. In other words, a secondary property of the relevant Event is Location, so assigning a large value for coordinateUncertaintyInMeters for that associated Location suggests that the Organism did exist, but the location it was stated to have existed at the time indicated in the Event is doubtful.

But this doesn't address the "when" part of the Event. In some cases, the "when" might be the property that is called into doubt (e.g., a record for a passenger pigeon in North Carolina in 1976, when the actual date was probably 1876; or the record for a Pacific Golden Plover in Hawaii in July; the latter more plausible than the former, but still doubtful).

And, of course there is the "what" part of the Organism. The Organism might have actually occurred at the time and location of the Event, but the identification of it is doubtful. Indeed, this is often the basis for assertions of "doubtful" as applied to Occurrences -- meaning that no one doubts that the Organism existed, and that it occurred at the Event, but it was simply misidentified to the wrong taxon.

In other contexts, the very existence of the Organism might be in doubt.

Another problem with the coordinateUncertaintyInMeters solution, alluded to by @sarahcd, is that it this parameter is one step removed from the Occurrence itself. For example, suppose I conduct a fish transect at 2m depth in Kaneohe Bay, Hawaii, where I have good GPS-derived coordinates of the Location. I record 25 Occurrences for the 25 different species I reportedly observed. One of those species is otherwise only known to occur in Antarctica at depths of >1000m; thus the occurrence of an Organism of that taxon at that location is highly doubtful. I can't just change the coordinateUncertaintyInMeters for the Location tied to that Event, because it would be incorrect for the other 24 Occurrence instances tied to that Event. I have to create a new Location instance, and then create a new Event instance to link that new Location to the Organism to capture the "doubtfulness" of the Occurrence.

For these (and many other) reasons, I think it might be best to capture properties of "doubt" in some other way (e.g., via MeasurementOrFact?) I realize that the term occurrenceStatus implies a suite of possible statuses, but this discussion emerged from the proposal to alter the definition of this term from:

"A statement about the presence or absence of a Taxon at a Location."

to:

"A statement about the presence or absence of an Organism within a bounded place and time."

(see #339)

In other words, to more properly match this property to an Occurrence sensu stricto (i.e., Organism-at-Event, instead of Taxon-at-Location). The part of the definition that is not proposed to be changed is the "about the presence of absence" part. Thus, it seems to me that a controlled vocabulary for this term should be restricted to "present" and "absent". I could see an argument for adding "uncertain", but I fear that trying to add "doubtful" might overly complicate the purpose of this term and it's controlled values.

albenson-usgs commented 3 years ago

doubtful: "The taxon is scored as being present in the area but there is some doubt over the evidence."

For me, this seems to be conflating data and annotations in the same term. Is there ever an instance where a data provider/collector would put "doubtful" in that field? If not, then I think it needs to be a separate term.

My perspective is similar on "extinct" or "extirpated". If the organism(s) were there at that time and place then the occurrenceStatus = "present". If a researcher goes out to a place where there was an organism there before but does not find it occurrenceStatus = "absent". I don't see how you would have an occurrence that has an occurrenceStatus = "extirpated" unless you find the last one that ever existed. Extinct/extirpated is a result from having multiple occurrenceStatus = present and then several at a later time that are absent.

nielsklazenga commented 3 years ago

For me, this seems to be conflating data and annotations in the same term. Is there ever an instance where a data provider/collector would put "doubtful" in that field? If not, then I think it needs to be a separate term.

I think you can use the same term in data and annotations on that data without conflating them, but that definition is not great. I just took it from the GBIF vocabulary. doubtful just means that there is doubt about the presence of the Taxon (or Organism) in an area. The doubtful is not in an annotation on the original occurrence record, but in a new occurrence record, most likely for a different (more inclusive) Location than the present record. What is the difference between an absence record based on a survey and one based on over a hundred years of collections and literature and knowledge of a taxon? And yes, there are plenty of instances where data providers would put 'doubtful' in that field. I would not argue so hard if we did not really need it.

There is no real distinction between an Organism-at-Event occurrence and a Taxon-at-Location occurrence, as an Organism belongs to a Taxon (and an Organism can comprise anything from an individual organism to an entire taxon) and an Event is at a Location. The difference is only in the size of the area and the duration of the Event, a matter of scale more than anything else.

My perspective is similar on "extinct" or "extirpated". If the organism(s) were there at that time and place then the occurrenceStatus = "present". If a researcher goes out to a place where there was an organism there before but does not find it occurrenceStatus = "absent". I don't see how you would have an occurrence that has an occurrenceStatus = "extirpated" unless you find the last one that ever existed. Extinct/extirpated is a result from having multiple occurrenceStatus = present and then several at a later time that are absent.

That is all true of course, but extirpated is more informative, especially when the previous present records are not readily at hand. And if the duration of the Event is long enough (or the area small enough) an Organism can be extirpated during an Event. It is also possible to have PreservedSpecimens from Locations where the Taxon no longer occurs, for which dots still show up on maps and then an extirpated assertion makes more sense than an absent assertion. Not everybody may want to use these terms and they certainly do not apply in all situations/use cases, but we need them for floristic and faunistic data and they are occurrenceStatus, also under the changed definition of #339. And I do not understand what the problem is. TDWG vocabularies are SKOS vocabularies (I think), so there could just be statements like extirpated skos:broader absent and excluded skos:broader absent. Just makes for a better vocabulary (and better data).

deepreef commented 3 years ago

For me, this seems to be conflating data and annotations in the same term.

I agree with @albenson-usgs here. I didn't articulate it well in my previous post, but it seems to me the present/absent dichotomy represents primary data, as in: "I/we record this Organism at this Event" (present); or "I/we specifically looked for any Organism that I/we would consider to be a member of this Taxon at this Event, but we failed to find one" (absent). However, a value of "doubtful" seems to me to only be applicable as an annotation on an earlier assertion of Occurrence, as in: "An Organism of this Taxon has been recorded at this Event by someone else, but we doubt its veracity."

I don't see how you would have an occurrence that has an occurrenceStatus = "extirpated" unless you find the last one that ever existed. Extinct/extirpated is a result from having multiple occurrenceStatus = present and then several at a later time that are absent.

and from @nielsklazenga :

What is the difference between an absence record based on a survey and one based on over a hundred years of collections and literature and knowledge of a taxon?

I think there can be a distinction, and I think this only works if we wish to document information about Taxon-at-Location instances through Occurrences by extending Organism to include "population", and including a time range context. There are two different statements:

I believe it's possible to clearly make both statements through Occurrence instances, because both involve Location +Time [=Event] + Organism + taxonomic Identification. The main difference is that the former would presumably score the Organism as a single individual and provide a precise Location and time fort the Event, and the latter would score the Organism as a population, with broader/less precise values of Location and time.

Incidentally, the recording of "absent" is a little odd to capture for the first type in the list above, because it implies the existence of an Organism with an asserted taxonomic Identification that was absent from the Event. We deal with this in our implementation by allowing the creation of "virtual" Organism instances -- referring to the abstract idea of an Organism, rather than a particular Organism (we established this in part to accommodate recording explicit "absences").

All that said, I am still uncomfortable using occurrenceStatus to capture what seems to me to be, as @albenson-usgs noted, a conflation of data and annotations.

There is no real distinction between an Organism-at-Event occurrence and a Taxon-at-Location occurrence, as an Organism belongs to a Taxon (and an Organism can comprise anything from an individual organism to an entire taxon) and an Event is at a Location. The difference is only in the size of the area and the duration of the Event, a matter of scale more than anything else.

Well... I think there is a difference between Organism-at-Event and Taxon-at-Location in that the latter has no time component. I would agree that there is (potentially) no difference between Organism-at-Event and Taxon-at-Event (allowing for sufficiently broad ranges of time) -- especially if we agree that:

an Organism can comprise anything from an individual organism to an entire taxon

[And we do seem to agree about this!]

nielsklazenga commented 3 years ago

All that said, I am still uncomfortable using occurrenceStatus to capture what seems to me to be, as @albenson-usgs noted, a conflation of data and annotations.

That might be so, but it is used that way now, for example in the GBIF Species Distribution Vocabulary, and is completely in accordance with the definition. I am completely fine with it and do not see how this is annotation (on what?) and not data that is collected in a different way.

I do not think we really disagree on any of this, but if people want to make this distinction between low-level and high-level occurrences (I would never throw them together in the same data set, but I do think we can use some of the same terms for them), and want a term for use with only primary occurrence data that only allows the values present and absent, that is a new term. As it stands now, the proposed vocabulary (this issue) does not fit the proposed definition (#339) for occurrenceStatus and violates the stability requirement in the Vocabulary Maintenance Specification.

nielsklazenga commented 3 years ago

Sorry, could not let it go...

Well... I think there is a difference between Organism-at-Event and Taxon-at-Location in that the latter has no time component. I would agree that there is (potentially) no difference between Organism-at-Event and Taxon-at-Event

Occurrences do have a time component by definition; it is not brought by the Event. Darwin Core Events are actions by humans or machines and only coincide with Occurrences. If there is a difference between Organism-at-Event and Taxon-at-Location, it is not in the nature of the Occurrence, but in the nature of the Observation.

deepreef commented 3 years ago

Occurrences do have a time component by definition; it is not brought by the Event. Darwin Core Events are actions by humans or machines and only coincide with Occurrences. If there is a difference between Organism-at-Event and Taxon-at-Location, it is not in the nature of the Occurrence, but in the nature of the Observation.

I disagree. There are no date or time properties associated with the Occurrence class in DwC -- these properties are inherited from the associated dwc:Event (eventDate, eventTime, startDayOfYear, endDayOfYear, year, month, day, verbatimEventDate). This is why I refer to an Occurrence as the intersection of an Organism and an Event. The taxonomic identification of the Occurrence is inherited from the Organism; the time and Location of the Occurrence are inherited from the Event.

Obviously, DwC is not an ontology, so the organization of terms within classes does not mandate that the terms only be used in connection with the associated classes. However, a big part of these various discussions is about inferring an ontology (of sorts) or semantic data model from the clustering of DwC terms within the DwC classes. And in that context, Occurrence instances inherit their location and time properties from an associated Event, just as they inherit their taxonomic identification properties from an associated Organism.

nielsklazenga commented 3 years ago

I do not understand where you get these notions from. These are the definitions:

Occurrence An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time.

Event An action that occurs at some location during some time. Example: A specimen collection process. A camera trap image capture. A marine trawl.

Looks to me Occurrence has a time component (by definition) and is independent of an Event. The only interface between an Occurrence and an Event is that they happen at the same place and the same time, so coincidence.

Obviously, DwC is not an ontology, so the organization of terms within classes does not mandate that the terms only be used in connection with the associated classes. However, a big part of these various discussions is about inferring an ontology (of sorts) or semantic data model from the clustering of DwC terms within the DwC classes. And in that context, Occurrence instances inherit their location and time properties from an associated Event, just as they inherit their taxonomic identification properties from an associated Organism.

Darwin Core is a vocabulary, so I think it would be best if we would keep the data models out of it and stick to the definitions. The only thing that matters for a term to be included in a controlled vocabulary is (1) whether it is used within the domain the standard applies to (all the terms that I suggested are) and (2) whether it fits the definition of the term the vocabulary is for (they all do). Everything else is irrelevant. If there is going to be a normative vocabulary (I would prefer it to be informative), it has to be inclusive.

That being said, I think you are confusing occurrences and observations and you have an interesting idea of inheritance.

albenson-usgs commented 3 years ago

I have to agree with @deepreef here. I disagree that the definitions of the terms are the only thing that matters. If you look at the terms that are included in the Occurrence Class you won't find any that cover time or place. To me Darwin Core is less of a vocabulary and more of a way to model data (e.g. scientific names can be column headers because all scientific names need to fall under scientificName). Actually modeling the data using Darwin Core is what allows for the integration of multiple datasets together. To me that's the whole point.

deepreef commented 3 years ago

Hi @nielsklazenga :

I do not understand where you get these notions from.

Well... I guess from many, many conversations at TDWG meetings and other gatherings and emails and other online discussions going back to the very earliest days of Darwin Core. As I said, nobody misinterprets DwC as an Ontology, but the discussions surrounding its creation and evolution are definitely driven by a need for ontological interpretations of the terms. That's why the Classes were added originally. Admittedly, some have explored this more explicitly than others -- for example Roger Hyam did some of the early work in this direction, and @baskaufs and Cam Webb developed Darwin-SW. More generally, coming to a common consensus of the core "objects" in biodiversity informatics informs the need to establish new terms and refine existing terms in ways that better allow content holders to share data in a more consistent and "actionable" way.

So yes, in one sense, DwC is just a "bag of terms" that have clear definitions, which are only "organized" in classes. However, each xxxxID term in DwC implies a conceptual object to which specific properties apply, and the more consistent data providers are with establishing parity/cardinality between DwC terms as properties applied to corresponding conceptual objects represented by xxxxxID terms, the more effective our ability will be in sharing data with each other in a way that empowers aggregate analysis.

So, when I read the definition of dwc:Occurrence as "An existence of an Organism at a particular place at a particular time", and Event as "An action that occurs at some location during some time", I interpret: Occurrence = Organism+Event; where the "what" part (Organism) is explicitly referenced in the definition of Occurrence, and the "where/when" part (Event) is implicitly referenced in the definition of Occurrence.

If I'm not alone in this interpretation, then perhaps we should consider modifying the definition of Occurrence to be something more explicit for both Occurrence and Event, like:

"An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular Event (sensu http://rs.tdwg.org/dwc/terms/Event)."

Would you (or anyone else) object to such a refinement of the definition of dwc:Occurrence?

nielsklazenga commented 3 years ago

Would you (or anyone else) object to such a refinement of the definition of dwc:Occurrence?

Not offhand, as long as there is an alternative in Darwin Core for the use cases that will no longer be accommodated. I think it would make much more sense to have dwc:Occurrence as a superclass (or abstract class) that has the properties that are shared between more specific ObservedOccurrence and TaxonAreaOccurrence classes. Also observing that Occurrences s.str. can only be present; as soon as you allow absent, you are in the domain of Occurrences s.l.

The argument here though is not about which data containers we should have. it is about whether 'doubtful', 'excluded' and 'extirpated' are in scope for a vocabulary for occurrenceStatus, or rather whether a vocabulary with only 'absent' and 'present' is sufficient. The scope of dwc:Occurrence as a data carrying object is irrelevant here. In fact, occurrenceStatus is going to be problematic with really strictly defined Occurrences (as data objects), as occurrences are by definition 'present' and 'absent' is going to be the most problematic, as "absent" occurrences are not occurrences.

nielsklazenga commented 3 years ago

@tucotuco in #339:

Does the proposal to have a controlled vocabulary for occurrenceStatus hold merit? 'Yes' 'Yes'

Should there be additional values in the controlled vocabulary? Controversial Comment: So far there is no consensus on the inclusion of any terms other than present and absent.

This is turning the world upside-down a bit. If you enforce a vocabulary on a term that did not have one before, the initial vocabulary needs to include all the terms that have been used (appropriately, in accordance with the definition) before and you'd need community consensus on whether you can leave a term out. You cannot just start with an empty vocabulary and require community consensus on whether a new controlled term should be included or not, as it becomes about people accepting or rejecting other people's use cases, rather than about the meaning of terms.

I think the purpose of vocabularies should be mainly to enable interpretation of the values that people deliver, rather than cramping people's style. The latter seems more a job for application profiles. It is mostly this part of the proposal that bothers me:

The list of terms page for the controlled vocabulary will contain instructions that the controlled value strings MUST be used as values for dwc:occurrenceStatus and that term IRIs MUST be used as values for dwciri:occurrenceStatus.

...,, but maybe I am interpreting this incorrectly and it means that the controlled terms can only be used with occurrence status and nowhere else (bit difficult for 'present' and 'absent') rather than that these are the only terms that can be used with occurrenceStatus. @baskaufs , could you explain?

Is the proposed definition of present satisfactory? 'No' Comment: The definition will have to be in accord with the final accepted definition of occurrenceStatus. A generic definition that will almost certainly work regardless of other concerns is something like the following: the target was detected within the given bounds of place and time

Is the proposed definition of absentsatisfactory? 'No' Comment: The definition will have to be in accord with the final accepted definition of occurrenceStatus. A generic definition that will almost certainly work regardless of other concerns is something like the following: the target was not detected within the given bounds of place and time

These definitions are not what 'present' and 'absent' mean. Moreover, they place 'present' and 'absent' firmly out of scope of a vocabulary on occurrenceStatus, as they are not about the occurrence, but the observation. If people really want to have a term only about whether an occurrence has been detected in a particular Event or not, I'd suggest that requires a new property.

nielsklazenga commented 3 years ago

I have to agree with @deepreef here. I disagree that the definitions of the terms are the only thing that matters. If you look at the terms that are included in the Occurrence Class you won't find any that cover time or place. To me Darwin Core is less of a vocabulary and more of a way to model data (e.g. scientific names can be column headers because all scientific names need to fall under scientificName). Actually modeling the data using Darwin Core is what allows for the integration of multiple datasets together. To me that's the whole point.

You should not model data in the context of a standard for a domain, as you will always be talking in circles. The data that you are modelling will always fit and the data that you are not modelling hardly ever will. You should model the domain (the entire domain). An instance of dwc:Occurrence itself is not an occurrence, but an instance of dwc:PreservedSpecimen or dwc:HumanObservation etc. (or a specimen or an observation). An occurrence, as in what the word means, is an extension of a dwc:Occurrence. That is why you can have absent dwc:Occurrences. Occurrences (as in what the word means) are by definition present.

Taxon distribution elements, or Taxon-at-Locations, have the same extension, so occurrenceStatus applies (or can be applied) to those as well, as is current usage and the intention of Darwin Core. I could argue that terms like establishmentMeans and degreeOfEstablishment actually better fit the distribution element records than the event-based occurrence records that are instances of dwc:Occurrence.

I think it is definition and usage that matter. Definitions are of course influenced by models that people have in their heads, so I am not saying they are not important. It is only the definitions that are ratified though, so when we are talking about appropriate usage of a term, definition is indeed the only thing that matters. I am only talking about the 'occurrence' in occurrenceStatus, by the way, not the definition of dwc:Occurrence. It was @deepreef himself who suggested we could use dwc:Occurrence for taxon distribution elements.

All the extra controlled terms I have suggested, by the way, can be easily mapped to present or absent (only to absent actually), if the different types of data sets ever intersect (not all data sets should be merged). They just have some more information content and have been used in the domain – and as occurrenceStatus – for years and I think the purpose of Darwin Core is not so much to restrict the terms that can be used, but to make sense of the terms that are used within the domain. They also give the reason why we include the record in the data, as we do not tend to record every absence. The absent in the sense of not detected would also fit as one of those refinements of absent in the sense of not present (with present being existing in a Location at a given time), but people chose not to do so and that is fine.

deepreef commented 3 years ago

It was @deepreef himself who suggested we could use dwc:Occurrence for taxon distribution elements.

I assume what you mean here is my assertion that we can capture information about Taxon-at-Location in the context of Organism-at-Event (i.e., how I characterize dwc:Occurrence) if we have a sufficiently broad scope for Organism (i.e., extending up to at least "population" and perhaps all the way to and including Taxon in the sense framed by @tucotuco when he wrote "it should be perfectly fine to have an Organism that includes every member of a Taxon at any scale, including the whole planet"); and if we understand that the "at-Location" part is bounded by time, which (again, in my view) translates to "at-Event".

I still believe that to be true (even more so now than before). My broader point is that while the DwC standard is not an ontology, it behooves us to establish definitions for terms (classes and properties) that (again, as suggested by @tucotuco), shift DwC closer to a place where it can be represented more explicitly as a semantic framework.

nielsklazenga commented 3 years ago

@deepreef, this is what I was referring to:

So... having thought about it a bit more, I think an alternate (and perhaps superior) approach to the "Taxon-at-Location" issue is to keep it as Occurrence (i.e., Organism-at-Event), but expand the potential scope of Organism to include something like "population".

I still believe that to be true (even more so now than before). My broader point is that while the DwC standard is not an ontology, it behooves us to establish definitions for terms (classes and properties) that (again, as suggested by @tucotuco), shift DwC closer to a place where it can be represented more explicitly as a semantic framework.

I took what @tucotuco was saying as that terms should be broadly applicable.

deepreef commented 3 years ago

@nielsklazenga : yes, that's what I mean by "my assertion that we can capture information about Taxon-at-Location in the context of Organism-at-Event"

I think we're close to full agreement here in terms of what Occurrences can represent. The main differences seem to be how strictly we treat the DwC Classes as semantic objects, vs. organizational conveniences.

nielsklazenga commented 3 years ago

So here is a straw man for a more comprehensive vocabulary for occurrenceStatus:

controlled value string: present definition: The Organism is present at the Location at the given time English language label: present alternate labels: extant

controlled value string: absent definition: The Organism is not present at the Location at the given time English language label: absent narrower: excluded, extirpated

~controlled value string: endemic definition: Only present within the confines of the given Location English language label: endemic broader: present usage note: 'endemic' can also be taken to mean naturally occurring within an area. This term is only to be used in the meaning of not occurring outside the given Location and only in the context of distribution of taxa.~

controlled value string: doubtful definition: Presence of the Organism at the Location at the given time is doubtful English language label: doubtful usage note: doubtful can be used when there are only unverifiable and/or historical reports of the taxon (or a taxon with the same scientificName) in the Location. The nature of the doubt can be in the identity of the taxon reported, i.e. the report could be based on a misidentification or a different taxon concept; the location of the reported taxon, e.g. because a specimen might have been mislabeled; whether the taxon is still present at the Location, i.e. with very old reports; or a combination of the above.

controlled vocabulary string: excluded definition: The Organism is considered absent from the Location, despite earlier reports of a Taxon with the same name from the Location English language label: excluded broader: absent usage notes: A taxon is excluded from a Location when all verifiable reports of the taxon in the Location have been dismissed.

controlled vocabulary string: extirpated definition: The Organism is known to have been resident at the Location, but has disappeared and is absent from the Location at the given time English language label: extirpated alternate labels: extinct broader: absent usage notes: A taxon is considered extirpated in a Location when it is known to have been living in that Location, but is no longer there. 'extinct' is often used in this meaning, but here 'extinct' is taken to mean that the taxon no longer exists.

EDIT: changed definitions as per @qgroom's suggestion [2021-05-12]

qgroom commented 3 years ago

@nielsklazenga All of your definitions need to refer to a specific time and place, but particularly absent. doubtful can apply to one period and not another in the same location extirpated can be reversed, even multiple times. endemic is particularly difficult, because it additionally relies on establishmentMeans and could be in conflict with it.

I see the need for doubtful, but it always begs the question what was doubtful. For example, I can seeing it being used to indicate that a bird was doubtfully a resident, even if it was present.

nielsklazenga commented 3 years ago

@qgroom, happy to make those changes. I did this quite quickly. I thought I could get away with not repeating the location and time in the definitions by referring to present, which has it, but I will take your advice. I had been thinking about it, but wasn't sure.

We might just leave endemic out? It is also a difficult term for me, because of its different meanings and because some of our botanists want to use 'endemic' and 'extinct' in the same record.

Regarding doubtful, it is always difficult with words that can be used in different contexts. Should we do something like:

Presence of the Organism in the Location at the given time is doubtful

, i.e. defining the term rather than the word? And then do the same for the other terms?

qgroom commented 3 years ago

Presence of the Organism in the Location at the given time is doubtful

, i.e. defining the term rather than the word? And then do the same for the other terms?

Yes, seems better

tucotuco commented 3 years ago

This proposal has been labelled as controversial. If no evidence of consensus can be reached by the 30-day minimum review period, the proposal will be deferred for later consideration. If it evidence that a consensus can be reached, the review period will be extended for an additional 30 days from the time apparent consensus is established (everyone participating in the discussion expresses their satisfaction with the proposed solution).

tucotuco commented 3 years ago

This proposal has been labeled as 'Controversial' and in need of a task group to for resolution. It is no longer part of an active public review.