traitecoevo / infinitylists

infinitylists allows you to generate place-based species lists for anywhere in Australia, pulling collection data and iNaturalist records from the Atlas of Living Australia.
https://unsw.shinyapps.io/infinitylists/
Creative Commons Attribution 4.0 International
3 stars 0 forks source link

Records in ALA not getting retrieved by the app #50

Closed tmesaglio closed 1 year ago

tmesaglio commented 1 year ago

When executing a search for plants in the preloaded North Head polygon, it retrieves 13 records associated with collections. However, if I use the ALA spatial tool, apply the exact same polygon, and then download that species data, it retrieves 119 collection-based records

some of these were rightfully excluded by our app, as we set the maximum threshold for coordinate uncertainty to be <= 1000 m; there are a number of collections with coordinates in North Head, but with very large uncertainty values (eg 25,000 m), so we don't want these

BUT, from my ALA direct download, if I filter to collection records w/ uncertainty <= 1000 m (or with that field blank, we're accepting those too), there are actually 51 records. So 38 are somehow not coming through into our app.

For at least one of these, I figured out why.

For this record: https://biocache.ala.org.au/occurrences/e888493f-9b46-4736-859a-02a83f2f8f6b It's getting excluded from the default maps because it's been annotated as a duplicate record associated with a different representative record (as an aside, that's completely bogus, as the 'representative record' doesn't even have the same coordinates, and indeed is actually a bit outside the North Head area)

So ideally as fix no. 1 to address what seems to be a multi-pronged problem, we should ideally override the "Exclude duplicate records" data profile filter in the ALA and actually include those records. I'm yet to check how many of the 38 missing records fall within this reason, but it's at least 1.

However, there are some records where I have no explanation for the app failing to retrieve them. Eg https://biocache.ala.org.au/occurrences/f8f31d02-779a-457a-9b9b-739d2e29db12

I cannot find any reason at face value why it's not getting pulled in. Coordinates fall within the area, it's IDed to species, it's a collection, coordinate uncertainty is blank.

tmesaglio commented 1 year ago

I will check all 38 missing records tonight to see why each is missing

tmesaglio commented 1 year ago

(will keep updating this comment)

  1. https://biocache.ala.org.au/occurrences/0413de31-d4c5-4a2b-b44d-ec57bcf880f3 - no explanation
  2. https://biocache.ala.org.au/occurrences/f8f31d02-779a-457a-9b9b-739d2e29db12 - no explanation
  3. https://biocache.ala.org.au/occurrences/e888493f-9b46-4736-859a-02a83f2f8f6b - excluded due to "Exclude duplicate records" data profile filter"
fontikar commented 1 year ago

OK this has to do with the data quality profiles we are using! Heres the plan, I'm working on a new branch 'data-profiles' I'm going to download the data for plants with no data profile applied. This is a little risky, cause the download is considerably bigger without the data quality profile...4 mil records more...

Screenshot 2023-09-08 at 8 20 34 pm

Then I'll get you @tmesaglio to check if the records you wanted are in the new download

fontikar commented 1 year ago
Screenshot 2023-09-08 at 8 22 46 pm

maybe we can use the AVH profile?? thoughts @tmesaglio

fontikar commented 1 year ago

or we can make our own "profile" in R code as a precleaning step

tmesaglio commented 1 year ago

@fontikar so the two unexplained examples above are also being excluded due to one of these filters? (if yes, which one are the falling under?)

fontikar commented 1 year ago

(will keep updating this comment)

  1. https://biocache.ala.org.au/occurrences/0413de31-d4c5-4a2b-b44d-ec57bcf880f3 - no explanation
  2. https://biocache.ala.org.au/occurrences/f8f31d02-779a-457a-9b9b-739d2e29db12 - no explanation
  3. https://biocache.ala.org.au/occurrences/e888493f-9b46-4736-859a-02a83f2f8f6b - excluded due to "Exclude duplicate records" data profile filter"

i wonder if 1 and 2 is because of: "Coordinate uncertainty meters invalid" under Data quality tests

tmesaglio commented 1 year ago

it's not that reason, because there are two records that are getting pulled in by the app that also have that same invalid message https://biocache.ala.org.au/occurrences/d83dbced-3203-4ac5-a247-4c0c3ac340f5 https://biocache.ala.org.au/occurrences/4733f594-c29d-4093-8ea9-8a5a4a4a1df8

tmesaglio commented 1 year ago

under data quality tests, my two missing examples and the two examples that are getting sucked in in the comment directly above have identical data quality tests

fontikar commented 1 year ago

Difference in counts for all plantae in Aus

# With profile
galah_call() |> 
+   galah_identify("Plantae") |>
+   galah_apply_profile("CSDM") |> 
+   galah_filter(
+     species != "",
+     decimalLatitude != "",
+     year >= 1923,
+     basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
+   )  |> 
+   atlas_counts()
# A tibble: 1 × 1
     count
     <int>
1 15250795

> # Spatially valid ones 
> galah_call() |> 
+   galah_identify("Plantae") |>
+   galah_filter(
+     spatiallyValid == TRUE, 
+     species != "",
+     decimalLatitude != "",
+     year >= 1923,
+     basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
+   )  |> 
+   atlas_counts()
# A tibble: 1 × 1
     count
     <int>
1 20421229

> # Outlier
> galah_call() |> 
+   galah_identify("Plantae") |>
+   galah_filter(
+     outlierLayerCount < 1, 
+     species != "",
+     decimalLatitude != "",
+     year >= 1923,
+     basisOfRecord == c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN")
+   )  |> 
+   atlas_counts()
# A tibble: 1 × 1
     count
     <int>
1 20274237
fontikar commented 1 year ago

Fonti broke galah (growing queue), will try tomorrow, if its still broken then Modnay!

fontikar commented 1 year ago

Need to download isDuplicateOf to see what sort of variable it is too

tmesaglio commented 1 year ago

Fixed by Fonti, we had to remove some of the CDSM filters applied by the ALA