Closed damianooldoni-bot closed 4 months ago
Loads of removed lines, to be investigated
Should there be a test that triggers if a certain threshold of lines/records get removed?
428 changed lines
~6004 records are now missing the occurrence dataset. Has the filter changed?~
Correction, the dataset is 6004 records shorter, actually, 6431 were deleted. The difference is new records.
I don't see any immediate reason why these records would have been deleted.
427 new records, 6431 deleted records. No new species.
6 species:
scientificName | n |
---|---|
Vespa velutina | 271 |
Ondatra zibethicus | 150 |
Fallopia japonica | 2 |
Martes foina | 2 |
Castor fiber | 1 |
Gallus gallus domesticus | 1 |
On 8 days:
date | n |
---|---|
2023-11-13 | 146 |
2023-11-07 | 70 |
2023-11-08 | 50 |
2023-11-09 | 48 |
2023-11-06 | 47 |
2023-11-10 | 37 |
2023-11-14 | 18 |
2023-11-03 | 11 |
Postponed till we've had a talk with RATO
Blocked by question + #119
23 records from 2023 got deleted:
eventyear | n |
---|---|
2021 | 631 |
2022 | 5772 |
2023 | 29 |
To reproduce:
filter(reference, occurrenceID %in% deleted_records) %>%
mutate(eventyear=lubridate::year(eventDate)) %>%
count(eventyear)
RATO have restored the deleted records, but changed their object ID's: thus changing the occurrenceIDs.
To find the collision between the data we fetch now and the missing records:
# Are the new records just the deleted records with a different occurrenceID?
library(dplyr)
reference <-
readr::read_csv("https://raw.githubusercontent.com/riparias/rato-occurrences/main/data/processed/occurrence.csv",
show_col_types = FALSE)
current <-
readr::read_csv("data/processed/occurrence.csv", show_col_types = FALSE)
deleted_records <-
dplyr::setdiff(reference$occurrenceID, current$occurrenceID)
new_records <-
dplyr::setdiff(current$occurrenceID, reference$occurrenceID)
# deleted records
deleted_df <- dplyr::filter(reference, occurrenceID %in% deleted_records)
# new records
new_df <- dplyr::filter(current, occurrenceID %in% new_records)
# can we use a combination of the event id and the location to identify
# observations in the case the occurrenceID was changed? No, you'd need
# something date related because the pin isn't always moved
new_df <-
new_df %>%
rowwise() %>%
mutate(content_id =
digest::digest(c(
eventID, eventDate, verbatimLatitude, verbatimLongitude
)))
deleted_df <-
deleted_df %>%
rowwise() %>%
mutate(content_id =
digest::digest(c(
eventID, eventDate, verbatimLatitude, verbatimLongitude
)))
# try to find collisions by joining
semi_join(new_df, deleted_df, by = "content_id")
RATO can not recover the lost IDs, meaning that a large number of occurrenceIDs will change triggering a lot of email traffic. I've discussed this with @damianooldoni who is considering a warning email to the early alert users to prepare them for a full mailbox.
A more recent update to ocurrence.csv was merged
DO NOT MERGE, MERGE #154 OR MORE RECENT INSTEAD
Brief description
This is an automatically generated PR. The following steps are all automatically performed:
./data/processed
occurrenceID
, check that all occurrences have aeventID
andscientificName
, ...Note to the reviewer: the workflow automation is still in a development phase. Please, check the output thoroughly before merging to
main
. In case, improve the data fecthingfetch_data.Rmd
, the mappingdwc_mapping.Rmd
, both in./src
or the GitHub workflowsfetch-data.yaml
andmapping_and_testing.yaml
in./.github/workflows
.