riparias / rato-occurrences

DwC mapping of RATO vwz occurrences
MIT License
0 stars 1 forks source link

[AUTO] Update data #114

Closed damianooldoni-bot closed 4 months ago

damianooldoni-bot commented 10 months ago

DO NOT MERGE, MERGE #154 OR MORE RECENT INSTEAD


Brief description

This is an automatically generated PR. The following steps are all automatically performed:

Note to the reviewer: the workflow automation is still in a development phase. Please, check the output thoroughly before merging to main. In case, improve the data fecthing fetch_data.Rmd, the mapping dwc_mapping.Rmd, both in ./src or the GitHub workflows fetch-data.yaml and mapping_and_testing.yaml in ./.github/workflows.

PietrH commented 10 months ago

Loads of removed lines, to be investigated

PietrH commented 10 months ago

Should there be a test that triggers if a certain threshold of lines/records get removed?

PietrH commented 10 months ago

428 changed lines

PietrH commented 10 months ago

~6004 records are now missing the occurrence dataset. Has the filter changed?~

Correction, the dataset is 6004 records shorter, actually, 6431 were deleted. The difference is new records.

PietrH commented 10 months ago

I don't see any immediate reason why these records would have been deleted.

427 new records, 6431 deleted records. No new species.

6 species:

scientificName n
Vespa velutina 271
Ondatra zibethicus 150
Fallopia japonica 2
Martes foina 2
Castor fiber 1
Gallus gallus domesticus 1

On 8 days:

date n
2023-11-13 146
2023-11-07 70
2023-11-08 50
2023-11-09 48
2023-11-06 47
2023-11-10 37
2023-11-14 18
2023-11-03 11
PietrH commented 10 months ago

Postponed till we've had a talk with RATO

PietrH commented 10 months ago

Blocked by question + #119

PietrH commented 10 months ago

23 records from 2023 got deleted:

eventyear n
2021 631
2022 5772
2023 29

To reproduce:

filter(reference, occurrenceID %in% deleted_records) %>%
    mutate(eventyear=lubridate::year(eventDate)) %>%
    count(eventyear)
PietrH commented 9 months ago

RATO have restored the deleted records, but changed their object ID's: thus changing the occurrenceIDs.

To find the collision between the data we fetch now and the missing records:

# Are the new records just the deleted records with a different occurrenceID? 

library(dplyr)

reference <- 
  readr::read_csv("https://raw.githubusercontent.com/riparias/rato-occurrences/main/data/processed/occurrence.csv",
                  show_col_types = FALSE)

current <-
  readr::read_csv("data/processed/occurrence.csv", show_col_types = FALSE)

deleted_records <- 
  dplyr::setdiff(reference$occurrenceID, current$occurrenceID)

new_records <- 
  dplyr::setdiff(current$occurrenceID, reference$occurrenceID)

# deleted records
deleted_df <- dplyr::filter(reference, occurrenceID %in% deleted_records)
# new records
new_df <- dplyr::filter(current, occurrenceID %in% new_records)

# can we use a combination of the event id and the location to identify
# observations in the case the occurrenceID was changed? No, you'd need
# something date related because the pin isn't always moved

new_df <-
  new_df %>%
  rowwise() %>%
  mutate(content_id =
           digest::digest(c(
             eventID, eventDate, verbatimLatitude, verbatimLongitude
           )))

deleted_df <-
  deleted_df %>%
  rowwise() %>%
  mutate(content_id =
           digest::digest(c(
             eventID, eventDate, verbatimLatitude, verbatimLongitude
           )))

# try to find collisions by joining

semi_join(new_df, deleted_df, by = "content_id")
PietrH commented 8 months ago

RATO can not recover the lost IDs, meaning that a large number of occurrenceIDs will change triggering a lot of email traffic. I've discussed this with @damianooldoni who is considering a warning email to the early alert users to prepare them for a full mailbox.

PietrH commented 4 months ago

A more recent update to ocurrence.csv was merged