iDigBio / idb-backend

iDigBio server and backend code for data ingestion, media processing, record indexing, and data API.
GNU General Public License v3.0
7 stars 0 forks source link

datecollected assigned as current month and day #229

Open mgaynor1 opened 1 year ago

mgaynor1 commented 1 year ago

When the data.dwc:day or data.dwc:month is missing, but a data.dwc:year is provided, the datecollected column is assigned the current month and day.

This error comes from this: https://github.com/iDigBio/idb-backend/blob/3c9551c5032e28b80a9a50ad39ea44306cd69080/idb/helpers/conversions.py#L544-L606

Here is the line causing this issue: https://github.com/iDigBio/idb-backend/blob/3c9551c5032e28b80a9a50ad39ea44306cd69080/idb/helpers/conversions.py#L599

This is really easy to recreate in python as well:

import dateutil.parser     
import datetime    

year = "2010"   
dateutil.parser.parse(year).date()

Out[1]: datetime.date(2010, 4, 1)

danstoner commented 1 year ago

Hi @mgaynor1,

In order to do date range searches (collected between two dates) we convert "things that look like some kind of date" to an actual date.

What would you suggest the month and date be set to when none are provided?

mgaynor1 commented 1 year ago

As a botanist, I heavily use the eventDate to identify duplicate records. These could be two herbarium specimens taken from one individual, but deposited each at different herbarium that then send their records for ingestion on different days. Due to this ingestion process, the date is now meaningless and cannot be used to identify true duplicates.

Any study that has used the month and day provided by the datecollected column could be inferring biological meaning where it doesn't exist - at this time, I would caution the use of iDigBio data for any phenology studies unless researchers are only using the data.dwc columns. iDigBio's search feature should not come before the quality of this data and I urge you to prioritize correcting this. Obtaining data within a certain date range is meaningless when the date was made up during the ingestion of the data and has no biological meaning.

I suggest that you all stop creating data that has no meaning. Do not put a month and day when none is provided. You are not converting a "kind of date" to an "actual date", it is converted to a fake date.

Maybe just follow GBIF and convert things to ISO 8601 (2004) (YYYY-MM-DD, YYYY-MM, or YYYY) - see GBIFs description here and a great recent blog post on dates

Also - the minimum date of a collection shouldn't be 1700 but somewhere closer to 1550 (https://doi.org/10.2307/2421492).

danstoner commented 1 year ago

To identify duplicate records you may have more success using the as-published data fields rather than interpreted fields. The interpreted fields are subject to change over time.

A stronger way to say this is datecollected is not an appropriate field for that use case. You already found the better fields:

data.dwc:day or data.dwc:month ... data.dwc:year

I appreciate the links to the GBIF examples which are good suggestions for data providers on sharing / publishing dates. I do not see a discussion of how this affects data access, display, and discovery in the GBIF system on these fields, though.

For example, in GBIF's web ui, if a data record contains only a year such as '1960', which month in the histrogram contains that record?

Screenshot from 2023-04-02 13-36-09

In the idigbio case, I'd be more concerned about whether this hypothetical '1960' record would ever show up in any search that included a month such as '1960-06'.

Also - the minimum date of a collection shouldn't be 1700 but somewhere closer to 1550 (https://doi.org/10.2307/2421492).

Interesting! I confirm there are now some records in GBIF older than year 1700. We can use that "new" earlier date from now on, thanks for pointing that out.

mgaynor1 commented 1 year ago

Hey Dan - Take a step back and look at that screen capture from GBIF, notice a large number of records from January? They are assigning YEAR-01-01 when month and day are not provided. This is a standardized approach and means researchers can take out all 01-01 records if they need dates.

You are right, I found a workaround. But, do most data users know about the "data.dwc:" fields? Should iDigBio really return fake dates to users just to streamline a search feature? The answer to both of these is no.

Taking the current date of ingestion and assigning it as the month/day of specimen collection is wild and needs to be fixed.
If you have to assign months and days for functionality - assign January 1st to all records with missing months and days. Here are additional GBIF pages where they discuss searchability/sharing.

Finally -Where does iDigBio document these interpretations so users can make informed decisions?

danstoner commented 1 year ago

For future readers, the specific GBIF implementation discussion: https://discourse.gbif.org/t/gbif-api-supporting-ranges-in-occurrence-eventdate/3804

None of the solutions to inventing missing data are ideal and all of them have trade-offs.

Converting "1960" to "1960-01-01" is also creating a fake date. I understand your opinion that this type of fake date is preferred to the current one.

Previous domain experts on the project determined that artificially inflating how much collecting activity was happening on the first of every month, and the first of every year, was the poorer choice of the various potential solutions. Some managers or PIs had strong opinions on various aesthetics of this issue. For some research activities the current method was preferred (the introduction of these fake dates is statistically distributed across the entire date range rather than always falling into the first date bucket).

I do not have answers to all of your questions but I will make sure the Cyberinfrastructure team is aware of this github issue.

danstoner commented 1 year ago

José escalated this issue to the project PIs.

I don't have a complete response but did want to share the details I have so far:

  1. 'datecollected' is not a Darwin Core field. No providers are publishing that field. The GBIF materials are talking about 'dwc:eventDate'. In iDigBio, 'datecollected' is a computed / aggregate field and used to build a consistent-looking date on the Label View in the portal, and allow searches over date ranges.

  2. 'datecollected' is not a suitable field to use when looking for duplicate records. The computation of that field could vary over time (for example, if the PIs want to make a change). In most cases, the source fields which are accessed under the "data" subkey such as 'data.dwc:eventDate' would make a better choice, in the same way one would be looking at the source fields for genus and species information ('data.dwc:genus') rather than 'scientificname' which is subject to taxonomic interpretations. There are a number of date-related fields in Darwin Core, but providers sometimes only publish data into some of them.

  3. In iDigBio, 'eventdate' is currently stored as a string and contains the data as-published in 'dwc:eventDate'.

Some example data values in 'eventdate':

"1931-05-18/1931-05-25" "1967-00-00" "1967" "1988-05" "1920-05-21/24" "2008-01-12"

Whether and how we could exactly implement GBIF's solution would need additional nontrivial research by the team.

  1. The question of "where does the content of this iDigBio field come from?" is a longstanding gap of knowledge. In the future this could be addressed by additional documentation (possibly adding a column to https://github.com/iDigBio/idigbio-search-api/wiki/Index-Fields) but this is also complicated because the Portal currently sometimes looks in multiple available fields to display as much data as possible and doesn't necessarily map 1:1 to the search API. Working on making these aspects consistent would be a mini-project unto itself.
wilsotc commented 10 months ago

There are instances where the darwin core values take the approach of assigning the first day of the period for the dwc:month and dwc:day field. This is likely why the GBIF monthly histograms have excessive January counts. Here's an example: https://portal.idigbio.org/portal/records/4125d2a8-1bc1-4744-86be-549ac814b579

Would a data quality flag specifying the eventDate interval size in days or just a non specific eventDate flag be useful in excluding these records from analysis requiring a date?

Best practice for eventDate includes a time. We currently default to midnight. This is the same problem but with hours. Perhaps we could have a projected time data quality flag as well.

mgaynor1 commented 10 months ago

Hi there. In the example you provided, iDigBio is only sampling the starting date in the interval and is not assigning 01-01 artificially.

Intervals can be found in the DarwinCore eventDate field and should be found there.

This discussion focuses on datecollected, which is a field modified by iDigBio.eventDate is not modified by iDigBio and thus can be used in any analysis requiring a date. I do not think we should modify eventDate.

Adding a flag when datecollected includes an artificially generated date may be helpful for determining if someone should ever use date to search for records (if all datecollected are artificial, no they should not use datecollected to search for dates). As you mentioned, time is included, so all datecollectedwill likely have some artificial aspect, thus we would need multiple flags to account for this (ex. "year,month, day, and time are artificial" vs "time is artificial").

Here is a great paper that discusses dates in research: https://doi.org/10.1111/1365-2435.14173

wilsotc commented 10 months ago

How about a date specificity value that is based on days? An eventDate specifying the year 1912 would generate a date specificity value of 366. A year and month such as 1902-06 would generate a date specificity value of 30. An ISO 8601 2019 standard interval of 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z would generate a date specificity value of 437.104166667. Then you could use the value to exclude records based on your needs.

mgaynor1 commented 10 months ago

We are only discussing datecollected here.

The purpose of datecollected is to provide a searchable/ingestable value from eventDate. Users are interested in ranges of dates, they likely want collections between years or in certain months, not based on day of the year. So no, please do not make datacollected based on day of the year. Just to outline your logic for anyone else reading:

I strongly recommend you do not do this.

wilsotc commented 10 months ago

I was suggesting adding a data quality field in addition to the current fields not as a replacement to datecollected.

mgaynor1 commented 10 months ago

In all responses, please be specific about which field you are discussing.

An additional field based on your logic above would not be interpretable to users due to the intervals varying in length. If you want to flag with numbers, make a key and flag with numbers. However, I do not recommend number flags unless they are heavily documented and are stable values (ex. eventDate with only year = 366, eventDate with year and month, but no day = 30, eventDate as interval = 500, eventDate with year & month & day = 0 ). I wouldn't do this as it would likely not be interpretable to users, instead use explanatory text fields as flags (if we add flags).

wilsotc commented 10 months ago

This would not be a boolean value but a float similar to coordinate uncertainty values we currently have. A date such as 2007-03-01 would have a specificity value of 1.0 because it's an entire day from gte 00:00 to lt 00:00 of the subsequent day. A date with time 2007-03-01T13:00:00Z would be a one hour interval and have a date specificity of 1/24 of a day as it might be interpreted as during the hour of 1PM.

mgaynor1 commented 10 months ago

Maybe something you can propose to tdwg? I do not see how this helps with this issue and would encourage us not to create more fields without documenting what exists in the current fields.

The issue discussion here should shift back to the datecollected fields having randomly assigned dates.