iDigBio / ridigbio

ridigbio -- an R interface to iDigBio's API (see http://www.idigbio.org/)
http://idigbio.github.io/ridigbio/
Other
16 stars 10 forks source link

Date returned #44

Open mgaynor1 opened 8 months ago

mgaynor1 commented 8 months ago

This function currently returns "datecollected", which is a modified field and could lack biological meaning. Date instead should be returned as the following fields: "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day".

https://github.com/iDigBio/ridigbio/blob/608791e8d2fd5f9732db583e94adda4d172c3fcb/R/idig_search_records.R#L133

When this is modified, someone should reach out to spocc. They will need to update multiple scripts including:

https://github.com/ropensci/spocc/blob/59f6b3b192cd8a7bb990aab94748f3bc7b044dac/R/plugin_helpers.R#L34 https://github.com/ropensci/spocc/blob/59f6b3b192cd8a7bb990aab94748f3bc7b044dac/R/occ2df.R#L104 https://github.com/ropensci/spocc/blob/59f6b3b192cd8a7bb990aab94748f3bc7b044dac/R/plugins.r#L266

jbennettufl commented 8 months ago

After our last internal meeting it was determined that we will be making efforts to update datecollected such that if there is no month or day when creating datecollected it will be set to the first day or the first month. This is such that if only dwc:year has a value like "1984" then the datecollected would become 1984-01-01. Since this is an ongoing effort related to: https://github.com/iDigBio/idb-backend/issues/229 I don't think it would be necessary to fix it in both places because making a change in the R client would be unnecessary when the data itself eventually does successfully represent the darwin core fields. There is also the issue of breaking backwards compatibility and implementation details to address. So for instance, if we were attempt to update this field it does take some considerable overhead and there are still some side effects but the same thing can be achieved with the following code:

library(flipTime)
library("ridigbio")

DATECORRECTED_FIELDS <- c("uuid",
                          "occurrenceid",
                          "catalognumber",
                          "family",
                          "genus",
                          "scientificname",
                          "country",
                          "stateprovince",
                          "geopoint",
                          "data.dwc:eventDate",
                          "data.dwc:year",
                          "data.dwc:month",
                          "data.dwc:day",
                          "collector",
                          "recordset")

df <- idig_search_records(rq = rq, fields = DATECORRECTED_FIELDS, limit = 6000)

df <- within(df, datecollected <- as.Date("1970-01-01"))

for (i in seq_along(df$`data.dwc:eventDate`)) {
  if (!is.na(df$`data.dwc:eventDate`[i])) {
    #contains a slash, take the date to the left
    if ("/" %in% df$`data.dwc:eventDate`[i]) {
      date_range <- unlist(strsplit(df$`data.dwc:eventDate`[i], "/"))
      start_date <- AsDate(date_range[1], on.parse.failure = "warn")

      # Use the date to the left of the forward slash
      df$datecollected[i] <- start_date
    } else {
      # If "data.dwc:eventDate" is present but without a slash, use AsDate()
      df$datecollected[i] <- AsDate(df$`data.dwc:eventDate`[i],
                                    on.parse.failure = "warn")
    }
  } else {
    # If "data.dwc:eventDate" is not present, construct the date
    year <- df$`data.dwc:year`[i]
    month <- df$`data.dwc:month`[i]
    day <- df$`data.dwc:day`[i]

    # Construct the date based on available components
    if (!is.na(year) && !is.na(month) && !is.na(day)) {
      df$datecollected[i] <- AsDate(paste(year, month, day, sep = "-"),
                                    on.parse.failure = "warn")
    } else if (!is.na(year) && !is.na(month)) {
      df$datecollected[i] <- AsDate(paste(year, month, "01", sep = "-"),
                                    on.parse.failure = "warn")
    } else if (!is.na(year)) {
      df$datecollected[i] <- AsDate(paste(year, "01", "01", sep = "-"),
                                    on.parse.failure = "warn")
    } else {
      # Handle the case where there is no information to construct a date
      df$datecollected[i] <- NA
    }
  }
}

flipTime can be installed with the following commands:

require(devtools)
install_github("Displayr/flipTime")

As a workaround this code will work but it can be seen here that there is some considerable overhead in the logic required to generate the "proper" datecollected it is an n+1 problem and fixing the data would require no overhead at all. A less precise way to do this using native functions would be the following:

df <- within(df, datecollected <- as.Date(df$`data.dwc:eventDate`))

You can see here that there is no logic for determining dates from a separate year, month, or day but it demonstrates that as.Date can process all values in the DataFrame at once while something like flipDate which is more precise is unable to take all the values as a single parameter and transform them all at once. If there are any good suggestions for accomplishing this with little to no overhead and then modify the R library to use them we are open to suggestions but from these initial attempts at performing the suggested change getting the correct data straight from the source seems like the preferable solution at the moment.

mgaynor1 commented 8 months ago

Even with modifications to the ingestion process, the columns we return by default need to be modified.

I suggest we by default return date columns that are in the DarwinCore format. We should not modify these fields at all. When selecting fields to return to users by default, we should return interpretable fields that any user could use - this modification is meant to lower the learning curve for a data-user. Date modification by GBIF and iDigBio are important for indexing but are not helpful for phenological studies. Additionally, datecollected is not documented and is an internal field, we should not return a field that users cannot interpret. We should not modify any field values in any functions available within this package. I am not suggesting any modification.

Once a researcher or data-user downloads their data with this function, they then can modify it however they wish. We actually use the 4 date columns above in a function here: https://github.com/nataliepatten/gatoRs/blob/main/R/remove_duplicates.R

I suggested all 4 columns because, in my experience, some collections only fill out the eventDate, while others only fill out the day, month, and year.

jbennettufl commented 8 months ago

Ok, just to be super clear here you want me to remove this line: https://github.com/iDigBio/ridigbio/blob/608791e8d2fd5f9732db583e94adda4d172c3fcb/R/idig_search_records.R#L133 and replace it with "data.dwc:eventDate", "data.dwc:year", "data.dwc:month", and "data.dwc:day"? Is this correct?

mgaynor1 commented 8 months ago

Yes. Please add documentation as well.