Closed tmesaglio closed 1 year ago
I will check all 38 missing records tonight to see why each is missing
(will keep updating this comment)
OK this has to do with the data quality profiles we are using! Heres the plan, I'm working on a new branch 'data-profiles' I'm going to download the data for plants with no data profile applied. This is a little risky, cause the download is considerably bigger without the data quality profile...4 mil records more...
Then I'll get you @tmesaglio to check if the records you wanted are in the new download
maybe we can use the AVH profile?? thoughts @tmesaglio
or we can make our own "profile" in R code as a precleaning step
@fontikar so the two unexplained examples above are also being excluded due to one of these filters? (if yes, which one are the falling under?)
(will keep updating this comment)
- https://biocache.ala.org.au/occurrences/0413de31-d4c5-4a2b-b44d-ec57bcf880f3 - no explanation
- https://biocache.ala.org.au/occurrences/f8f31d02-779a-457a-9b9b-739d2e29db12 - no explanation
- https://biocache.ala.org.au/occurrences/e888493f-9b46-4736-859a-02a83f2f8f6b - excluded due to "Exclude duplicate records" data profile filter"
i wonder if 1 and 2 is because of: "Coordinate uncertainty meters invalid" under Data quality tests
it's not that reason, because there are two records that are getting pulled in by the app that also have that same invalid message https://biocache.ala.org.au/occurrences/d83dbced-3203-4ac5-a247-4c0c3ac340f5 https://biocache.ala.org.au/occurrences/4733f594-c29d-4093-8ea9-8a5a4a4a1df8
under data quality tests, my two missing examples and the two examples that are getting sucked in in the comment directly above have identical data quality tests
Difference in counts for all plantae in Aus
# With profile
galah_call() |>
+ galah_identify("Plantae") |>
+ galah_apply_profile("CSDM") |>
+ galah_filter(
+ species != "",
+ decimalLatitude != "",
+ year >= 1923,
+ basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
+ ) |>
+ atlas_counts()
# A tibble: 1 × 1
count
<int>
1 15250795
> # Spatially valid ones
> galah_call() |>
+ galah_identify("Plantae") |>
+ galah_filter(
+ spatiallyValid == TRUE,
+ species != "",
+ decimalLatitude != "",
+ year >= 1923,
+ basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
+ ) |>
+ atlas_counts()
# A tibble: 1 × 1
count
<int>
1 20421229
> # Outlier
> galah_call() |>
+ galah_identify("Plantae") |>
+ galah_filter(
+ outlierLayerCount < 1,
+ species != "",
+ decimalLatitude != "",
+ year >= 1923,
+ basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
+ ) |>
+ atlas_counts()
# A tibble: 1 × 1
count
<int>
1 20274237
Fonti broke galah (growing queue), will try tomorrow, if its still broken then Modnay!
Need to download isDuplicateOf to see what sort of variable it is too
Fixed by Fonti, we had to remove some of the CDSM filters applied by the ALA
When executing a search for plants in the preloaded North Head polygon, it retrieves 13 records associated with collections. However, if I use the ALA spatial tool, apply the exact same polygon, and then download that species data, it retrieves 119 collection-based records
some of these were rightfully excluded by our app, as we set the maximum threshold for coordinate uncertainty to be <= 1000 m; there are a number of collections with coordinates in North Head, but with very large uncertainty values (eg 25,000 m), so we don't want these
BUT, from my ALA direct download, if I filter to collection records w/ uncertainty <= 1000 m (or with that field blank, we're accepting those too), there are actually 51 records. So 38 are somehow not coming through into our app.
For at least one of these, I figured out why.
For this record: https://biocache.ala.org.au/occurrences/e888493f-9b46-4736-859a-02a83f2f8f6b It's getting excluded from the default maps because it's been annotated as a duplicate record associated with a different representative record (as an aside, that's completely bogus, as the 'representative record' doesn't even have the same coordinates, and indeed is actually a bit outside the North Head area)
So ideally as fix no. 1 to address what seems to be a multi-pronged problem, we should ideally override the "Exclude duplicate records" data profile filter in the ALA and actually include those records. I'm yet to check how many of the 38 missing records fall within this reason, but it's at least 1.
However, there are some records where I have no explanation for the app failing to retrieve them. Eg https://biocache.ala.org.au/occurrences/f8f31d02-779a-457a-9b9b-739d2e29db12
I cannot find any reason at face value why it's not getting pulled in. Coordinates fall within the area, it's IDed to species, it's a collection, coordinate uncertainty is blank.