Request only observations from GBIF?

niconoe commented 2 years ago

Following #80: should we only request observations (human and machine-based) at GBIF?

Or should we actually load all occurrences (but still call them observations in the user interface)? I guess it would be important to have at a look at existing GBIF data to see what we would lose by filtering more aggressively on this?

If we decide to go that route, we should also amend the recommendations to publishers to make sure they explicitly flag their observation as observations.

niconoe commented 2 years ago

With the current list of species, we wouldn't lose much records:

s

Summary for @damianooldoni and @timadriaens because I'd like you opinion here: we recently updated the Early Alert user interface so it uses the "observation" terminology everywhere (it was previously "occurrences"). I was therefore thinking that the system would be a bit more clear/tight if we apply a GBIF filter when doing the data download (so only human/machine observations are returned). Side note: at the "publishing side", we should also make sure the providers correctly set the basisOfRecord to some kind of observations

timadriaens commented 2 years ago

I would indeed stick to observation in the user interface. I guess users can always go to the actual record to explore whether it is a human observation or a machine observation. Thing with the preserved specimens of course is that they do contain actual distribution records (and sometimes mistakenly the coordinates of the museum instead), so for most scientific applications we need them in the download. However, for early warning, they are less relevant. By the time something ends up the herbarium or the collection, it is too late and in essence collecting is also "responsing" the thing no? Totally agree that it is important to educate publishers on the BasisOfRecord.

damianooldoni commented 1 year ago

The only thing we do not want in the alert tool are living specimen and fossils. Example from TrIAS project where GBIF data were extensively used as input data: https://github.com/trias-project/occ-cube/blob/37b1e8c5fd52c7146a5ff5ef564f40c01edf7895/src/1_download.Rmd#L47-L57

IMO we should specify it in the query as you did for year > 2000 (see #215)? No big difference in results for the data we already have (very few data will be removed), but we don't know about the future, isn't?

timadriaens commented 1 year ago

as outlined above: we won't ever rapid response fossils, and collected specimens have essentially been responsed... Additional element: we want this to be a early alert tool for new observations in the field, not a tool for new data publication of interesting occurrences. So I am all in for a smart selection of occurrences at query level and I would dare to exclude any categories that are not "rapid":

Material sample Material citation Preserved specimen Fossil specimen Living specimen

Wildlife camera images are "Machine observations" so these, we need definitely. There appears to be a category of BasisOfRecord called "occurrence" which I find a bit strange but we need those as well.

Additionally, also eDNA is relevant to many invasive species in RIPARIAS and beyond, and could be "rapid enough" for a response. I looked around a bit and it seems that "for the time being there is a lack of an appropriate value in the BasisOfRecord vocabulary for these data types" and that the recommendation is that these should go under MaterialSample for now (table)?

timadriaens commented 1 year ago

@niconoe maybe we already discussed this but I assume we also have filtering on a minimum CoordinateUncertainty (this is tricky as it is not always a field that is filled)?

damianooldoni commented 1 year ago

I can agree on TIm's idea to exclude occurrences which are not prone for a rapid response. As eDNA could go under materialSample, I would then include it then. We should then exclude:

Material citation
Preserved specimen
Fossil specimen
Living specimen

About CoordinateUncertaintyInMeters, I prefer to not have any preliminary filter on it.

timadriaens commented 1 year ago

OK, fine, I just thought that if data publishers do not see their data for that reason this could stimulate them to publish at higher spatial resolutions. But maybe this is more something for the data mobilization.

timadriaens commented 1 year ago

and do we exclude at least the records without any coordinates (e.g. records from "Belgium" or from "Flemish region" without an actual coordinate)?

damianooldoni commented 1 year ago

They are filtered at later stage, not while asking the data to GBIF. @niconoe: we could indeed filtering while querying the data to GBIF of course. However, @timadriaens: it doesn't really reduce the number of observations downloaded. Based on data on dev version (https://dev-alert.riparias.be/about-data), we would remove 51 obs at most as indicated by

Skipped observations in GBIF download: 51

The main reason is that on dev version we are already filtering by year (year >= 2000) and so removing a lot of old "obs" without coords (901 at most) as you can compare by https://alert.riparias.be/about-data

However, if we are extending the list of species, we are going to have more observations without coords. @niconoe: I think it's better to include hasCoordinate= TRUE in the download query.

niconoe commented 1 year ago

@damianooldoni: since the discussion has moved to many interesting but different directions here 🥲

If I understand correctly, the only action point for me right now is to use hasCoordinate= TRUE in the query? As discussed above, it's indeed a bit cleaner but shouldn't actively change the number of visible observations (less will be downloaded, but less will also be skipped when copied to the database.)

damianooldoni commented 1 year ago

Yes, hasCoordinate= TRUE should be added, indeed. I think you should also add the filter on basisOfRecord as pointed in my comment: https://github.com/riparias/early-alert-webapp/issues/83#issuecomment-1525720411 But it's not a dramatic improvement. We will remove some very few living specimen then.

damianooldoni commented 1 year ago

@niconoe: maybe filtering these obs out immediately by a stricter GBIF API query (see my very last comment above) can help in some way to minimally optimize the import? See #243

riparias / gbif-alert

Request only observations from GBIF? #83