damianooldoni-bot commented 10 months ago

DO NOT MERGE, MERGE #154 OR MORE RECENT INSTEAD

Brief description

This is an automatically generated PR. The following steps are all automatically performed:

Fetch raw data
Map raw data to DwC standard and save the output in ./data/processed
Get an overview of the changes
Run some tests, e.g. check the uniqueness of occurrenceID, check that all occurrences have a eventID and scientificName, ...

Note to the reviewer: the workflow automation is still in a development phase. Please, check the output thoroughly before merging to main. In case, improve the data fecthing fetch_data.Rmd, the mapping dwc_mapping.Rmd, both in ./src or the GitHub workflows fetch-data.yaml and mapping_and_testing.yaml in ./.github/workflows.

PietrH commented 10 months ago

Loads of removed lines, to be investigated

PietrH commented 10 months ago

Should there be a test that triggers if a certain threshold of lines/records get removed?

PietrH commented 10 months ago

428 changed lines

PietrH commented 10 months ago

~6004 records are now missing the occurrence dataset. Has the filter changed?~

Correction, the dataset is 6004 records shorter, actually, 6431 were deleted. The difference is new records.

PietrH commented 10 months ago

I don't see any immediate reason why these records would have been deleted.

427 new records, 6431 deleted records. No new species.

6 species:

scientificName	n
Vespa velutina	271
Ondatra zibethicus	150
Fallopia japonica	2
Martes foina	2
Castor fiber	1
Gallus gallus domesticus	1

On 8 days:

date	n
2023-11-13	146
2023-11-07	70
2023-11-08	50
2023-11-09	48
2023-11-06	47
2023-11-10	37
2023-11-14	18
2023-11-03	11

PietrH commented 10 months ago

Postponed till we've had a talk with RATO

PietrH commented 10 months ago

Blocked by question + #119

PietrH commented 10 months ago

23 records from 2023 got deleted:

eventyear	n
2021	631
2022	5772
2023	29

To reproduce:

filter(reference, occurrenceID %in% deleted_records) %>%
    mutate(eventyear=lubridate::year(eventDate)) %>%
    count(eventyear)

PietrH commented 9 months ago

RATO have restored the deleted records, but changed their object ID's: thus changing the occurrenceIDs.

To find the collision between the data we fetch now and the missing records:

# Are the new records just the deleted records with a different occurrenceID? 

library(dplyr)

reference <- 
  readr::read_csv("https://raw.githubusercontent.com/riparias/rato-occurrences/main/data/processed/occurrence.csv",
                  show_col_types = FALSE)

current <-
  readr::read_csv("data/processed/occurrence.csv", show_col_types = FALSE)

deleted_records <- 
  dplyr::setdiff(reference$occurrenceID, current$occurrenceID)

new_records <- 
  dplyr::setdiff(current$occurrenceID, reference$occurrenceID)

# deleted records
deleted_df <- dplyr::filter(reference, occurrenceID %in% deleted_records)
# new records
new_df <- dplyr::filter(current, occurrenceID %in% new_records)

# can we use a combination of the event id and the location to identify
# observations in the case the occurrenceID was changed? No, you'd need
# something date related because the pin isn't always moved

new_df <-
  new_df %>%
  rowwise() %>%
  mutate(content_id =
           digest::digest(c(
             eventID, eventDate, verbatimLatitude, verbatimLongitude
           )))

deleted_df <-
  deleted_df %>%
  rowwise() %>%
  mutate(content_id =
           digest::digest(c(
             eventID, eventDate, verbatimLatitude, verbatimLongitude
           )))

# try to find collisions by joining

semi_join(new_df, deleted_df, by = "content_id")

PietrH commented 8 months ago

RATO can not recover the lost IDs, meaning that a large number of occurrenceIDs will change triggering a lot of email traffic. I've discussed this with @damianooldoni who is considering a warning email to the early alert users to prepare them for a full mailbox.

PietrH commented 4 months ago

A more recent update to ocurrence.csv was merged

riparias / rato-occurrences

[AUTO] Update data #114

DO NOT MERGE, MERGE #154 OR MORE RECENT INSTEAD

Brief description