Open mgaynor1 opened 1 year ago
Hi @mgaynor1,
In order to do date range searches (collected between two dates) we convert "things that look like some kind of date" to an actual date.
What would you suggest the month and date be set to when none are provided?
As a botanist, I heavily use the eventDate to identify duplicate records. These could be two herbarium specimens taken from one individual, but deposited each at different herbarium that then send their records for ingestion on different days. Due to this ingestion process, the date is now meaningless and cannot be used to identify true duplicates.
Any study that has used the month and day provided by the datecollected column could be inferring biological meaning where it doesn't exist - at this time, I would caution the use of iDigBio data for any phenology studies unless researchers are only using the data.dwc columns. iDigBio's search feature should not come before the quality of this data and I urge you to prioritize correcting this. Obtaining data within a certain date range is meaningless when the date was made up during the ingestion of the data and has no biological meaning.
I suggest that you all stop creating data that has no meaning. Do not put a month and day when none is provided. You are not converting a "kind of date" to an "actual date", it is converted to a fake date.
Maybe just follow GBIF and convert things to ISO 8601 (2004) (YYYY-MM-DD, YYYY-MM, or YYYY) - see GBIFs description here and a great recent blog post on dates
Also - the minimum date of a collection shouldn't be 1700 but somewhere closer to 1550 (https://doi.org/10.2307/2421492).
To identify duplicate records you may have more success using the as-published data fields rather than interpreted fields. The interpreted fields are subject to change over time.
A stronger way to say this is datecollected
is not an appropriate field for that use case. You already found the better fields:
data.dwc:day or data.dwc:month ... data.dwc:year
I appreciate the links to the GBIF examples which are good suggestions for data providers on sharing / publishing dates. I do not see a discussion of how this affects data access, display, and discovery in the GBIF system on these fields, though.
For example, in GBIF's web ui, if a data record contains only a year such as '1960', which month in the histrogram contains that record?
In the idigbio case, I'd be more concerned about whether this hypothetical '1960' record would ever show up in any search that included a month such as '1960-06'.
Also - the minimum date of a collection shouldn't be 1700 but somewhere closer to 1550 (https://doi.org/10.2307/2421492).
Interesting! I confirm there are now some records in GBIF older than year 1700. We can use that "new" earlier date from now on, thanks for pointing that out.
Hey Dan - Take a step back and look at that screen capture from GBIF, notice a large number of records from January? They are assigning YEAR-01-01 when month and day are not provided. This is a standardized approach and means researchers can take out all 01-01 records if they need dates.
You are right, I found a workaround. But, do most data users know about the "data.dwc:" fields? Should iDigBio really return fake dates to users just to streamline a search feature? The answer to both of these is no.
Taking the current date of ingestion and assigning it as the month/day of specimen collection is wild and needs to be fixed.
If you have to assign months and days for functionality - assign January 1st to all records with missing months and days. Here are additional GBIF pages where they discuss searchability/sharing.
Finally -Where does iDigBio document these interpretations so users can make informed decisions?
For future readers, the specific GBIF implementation discussion: https://discourse.gbif.org/t/gbif-api-supporting-ranges-in-occurrence-eventdate/3804
None of the solutions to inventing missing data are ideal and all of them have trade-offs.
Converting "1960" to "1960-01-01" is also creating a fake date. I understand your opinion that this type of fake date is preferred to the current one.
Previous domain experts on the project determined that artificially inflating how much collecting activity was happening on the first of every month, and the first of every year, was the poorer choice of the various potential solutions. Some managers or PIs had strong opinions on various aesthetics of this issue. For some research activities the current method was preferred (the introduction of these fake dates is statistically distributed across the entire date range rather than always falling into the first date bucket).
I do not have answers to all of your questions but I will make sure the Cyberinfrastructure team is aware of this github issue.
José escalated this issue to the project PIs.
I don't have a complete response but did want to share the details I have so far:
'datecollected' is not a Darwin Core field. No providers are publishing that field. The GBIF materials are talking about 'dwc:eventDate'. In iDigBio, 'datecollected' is a computed / aggregate field and used to build a consistent-looking date on the Label View in the portal, and allow searches over date ranges.
'datecollected' is not a suitable field to use when looking for duplicate records. The computation of that field could vary over time (for example, if the PIs want to make a change). In most cases, the source fields which are accessed under the "data" subkey such as 'data.dwc:eventDate' would make a better choice, in the same way one would be looking at the source fields for genus and species information ('data.dwc:genus') rather than 'scientificname' which is subject to taxonomic interpretations. There are a number of date-related fields in Darwin Core, but providers sometimes only publish data into some of them.
In iDigBio, 'eventdate' is currently stored as a string and contains the data as-published in 'dwc:eventDate'.
Some example data values in 'eventdate':
"1931-05-18/1931-05-25" "1967-00-00" "1967" "1988-05" "1920-05-21/24" "2008-01-12"
Whether and how we could exactly implement GBIF's solution would need additional nontrivial research by the team.
- The question of "where does the content of this iDigBio field come from?" is a longstanding gap of knowledge. In the future this could be addressed by additional documentation (possibly adding a column to https://github.com/iDigBio/idigbio-search-api/wiki/Index-Fields) but this is also complicated because the Portal currently sometimes looks in multiple available fields to display as much data as possible and doesn't necessarily map 1:1 to the search API. Working on making these aspects consistent would be a mini-project unto itself.
There are instances where the darwin core values take the approach of assigning the first day of the period for the dwc:month and dwc:day field. This is likely why the GBIF monthly histograms have excessive January counts. Here's an example: https://portal.idigbio.org/portal/records/4125d2a8-1bc1-4744-86be-549ac814b579
Would a data quality flag specifying the eventDate interval size in days or just a non specific eventDate flag be useful in excluding these records from analysis requiring a date?
Best practice for eventDate includes a time. We currently default to midnight. This is the same problem but with hours. Perhaps we could have a projected time data quality flag as well.
Hi there. In the example you provided, iDigBio is only sampling the starting date in the interval and is not assigning 01-01 artificially.
Intervals can be found in the DarwinCore eventDate field and should be found there.
This discussion focuses on datecollected
, which is a field modified by iDigBio.eventDate
is not modified by iDigBio and thus can be used in any analysis requiring a date. I do not think we should modify eventDate
.
Adding a flag when datecollected
includes an artificially generated date may be helpful for determining if someone should ever use date to search for records (if all datecollected are artificial, no they should not use datecollected
to search for dates). As you mentioned, time is included, so all datecollected
will likely have some artificial aspect, thus we would need multiple flags to account for this (ex. "year,month, day, and time are artificial" vs "time is artificial").
Here is a great paper that discusses dates in research: https://doi.org/10.1111/1365-2435.14173
How about a date specificity value that is based on days? An eventDate specifying the year 1912 would generate a date specificity value of 366. A year and month such as 1902-06 would generate a date specificity value of 30. An ISO 8601 2019 standard interval of 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z would generate a date specificity value of 437.104166667. Then you could use the value to exclude records based on your needs.
We are only discussing datecollected
here.
The purpose of datecollected
is to provide a searchable/ingestable value from eventDate
. Users are interested in ranges of dates, they likely want collections between years or in certain months, not based on day of the year. So no, please do not make datacollected
based on day of the year. Just to outline your logic for anyone else reading:
eventDate
only has a year, you are assigning datecollected
as the days in a leap year.eventDate
only has a year and month, you are assigning datecollected
the days in a montheventDate
is an interval, you are assigning datecollected
as the days between the two intervals, which can be very misleading, as not all intervals are more than a year and may just look like a day in the year.I strongly recommend you do not do this.
I was suggesting adding a data quality field in addition to the current fields not as a replacement to datecollected.
In all responses, please be specific about which field you are discussing.
An additional field based on your logic above would not be interpretable to users due to the intervals varying in length. If you want to flag with numbers, make a key and flag with numbers. However, I do not recommend number flags unless they are heavily documented and are stable values (ex. eventDate
with only year = 366, eventDate
with year and month, but no day = 30, eventDate
as interval = 500, eventDate
with year & month & day = 0 ). I wouldn't do this as it would likely not be interpretable to users, instead use explanatory text fields as flags (if we add flags).
This would not be a boolean value but a float similar to coordinate uncertainty values we currently have. A date such as 2007-03-01 would have a specificity value of 1.0 because it's an entire day from gte 00:00 to lt 00:00 of the subsequent day. A date with time 2007-03-01T13:00:00Z would be a one hour interval and have a date specificity of 1/24 of a day as it might be interpreted as during the hour of 1PM.
Maybe something you can propose to tdwg? I do not see how this helps with this issue and would encourage us not to create more fields without documenting what exists in the current fields.
The issue discussion here should shift back to the datecollected
fields having randomly assigned dates.
When the data.dwc:day or data.dwc:month is missing, but a data.dwc:year is provided, the datecollected column is assigned the current month and day.
This error comes from this: https://github.com/iDigBio/idb-backend/blob/3c9551c5032e28b80a9a50ad39ea44306cd69080/idb/helpers/conversions.py#L544-L606
Here is the line causing this issue: https://github.com/iDigBio/idb-backend/blob/3c9551c5032e28b80a9a50ad39ea44306cd69080/idb/helpers/conversions.py#L599
This is really easy to recreate in python as well:
Out[1]: datetime.date(2010, 4, 1)