Open niconoe opened 2 years ago
With the current list of species, we wouldn't lose much records:
s
Summary for @damianooldoni and @timadriaens because I'd like you opinion here: we recently updated the Early Alert user interface so it uses the "observation" terminology everywhere (it was previously "occurrences"). I was therefore thinking that the system would be a bit more clear/tight if we apply a GBIF filter when doing the data download (so only human/machine observations are returned). Side note: at the "publishing side", we should also make sure the providers correctly set the basisOfRecord
to some kind of observations
I would indeed stick to observation in the user interface. I guess users can always go to the actual record to explore whether it is a human observation or a machine observation. Thing with the preserved specimens of course is that they do contain actual distribution records (and sometimes mistakenly the coordinates of the museum instead), so for most scientific applications we need them in the download. However, for early warning, they are less relevant. By the time something ends up the herbarium or the collection, it is too late and in essence collecting is also "responsing" the thing no? Totally agree that it is important to educate publishers on the BasisOfRecord.
The only thing we do not want in the alert tool are living specimen and fossils. Example from TrIAS project where GBIF data were extensively used as input data: https://github.com/trias-project/occ-cube/blob/37b1e8c5fd52c7146a5ff5ef564f40c01edf7895/src/1_download.Rmd#L47-L57
IMO we should specify it in the query as you did for year > 2000 (see #215)? No big difference in results for the data we already have (very few data will be removed), but we don't know about the future, isn't?
as outlined above: we won't ever rapid response fossils, and collected specimens have essentially been responsed... Additional element: we want this to be a early alert tool for new observations in the field, not a tool for new data publication of interesting occurrences. So I am all in for a smart selection of occurrences at query level and I would dare to exclude any categories that are not "rapid":
Material sample Material citation Preserved specimen Fossil specimen Living specimen
Wildlife camera images are "Machine observations" so these, we need definitely. There appears to be a category of BasisOfRecord called "occurrence" which I find a bit strange but we need those as well.
Additionally, also eDNA is relevant to many invasive species in RIPARIAS and beyond, and could be "rapid enough" for a response. I looked around a bit and it seems that "for the time being there is a lack of an appropriate value in the BasisOfRecord vocabulary for these data types" and that the recommendation is that these should go under MaterialSample for now (table)?
@niconoe maybe we already discussed this but I assume we also have filtering on a minimum CoordinateUncertainty (this is tricky as it is not always a field that is filled)?
I can agree on TIm's idea to exclude occurrences which are not prone for a rapid response. As eDNA could go under materialSample, I would then include it then. We should then exclude:
About CoordinateUncertaintyInMeters
, I prefer to not have any preliminary filter on it.
OK, fine, I just thought that if data publishers do not see their data for that reason this could stimulate them to publish at higher spatial resolutions. But maybe this is more something for the data mobilization.
and do we exclude at least the records without any coordinates (e.g. records from "Belgium" or from "Flemish region" without an actual coordinate)?
They are filtered at later stage, not while asking the data to GBIF. @niconoe: we could indeed filtering while querying the data to GBIF of course. However, @timadriaens: it doesn't really reduce the number of observations downloaded. Based on data on dev version (https://dev-alert.riparias.be/about-data), we would remove 51 obs at most as indicated by
Skipped observations in GBIF download: 51
The main reason is that on dev version we are already filtering by year (year >= 2000) and so removing a lot of old "obs" without coords (901 at most) as you can compare by https://alert.riparias.be/about-data
However, if we are extending the list of species, we are going to have more observations without coords.
@niconoe: I think it's better to include hasCoordinate= TRUE
in the download query.
@damianooldoni: since the discussion has moved to many interesting but different directions here 🥲
If I understand correctly, the only action point for me right now is to use hasCoordinate= TRUE
in the query? As discussed above, it's indeed a bit cleaner but shouldn't actively change the number of visible observations (less will be downloaded, but less will also be skipped when copied to the database.)
Yes, hasCoordinate= TRUE
should be added, indeed.
I think you should also add the filter on basisOfRecord
as pointed in my comment: https://github.com/riparias/early-alert-webapp/issues/83#issuecomment-1525720411
But it's not a dramatic improvement. We will remove some very few living specimen then.
@niconoe: maybe filtering these obs out immediately by a stricter GBIF API query (see my very last comment above) can help in some way to minimally optimize the import? See #243
Following #80: should we only request observations (human and machine-based) at GBIF?
Or should we actually load all occurrences (but still call them
observations
in the user interface)? I guess it would be important to have at a look at existing GBIF data to see what we would lose by filtering more aggressively on this?If we decide to go that route, we should also amend the recommendations to publishers to make sure they explicitly flag their observation as observations.