jhpoelen / effechecka

create taxonomic checklists and monitor biodiversity data access
MIT License
1 stars 2 forks source link

unexpected results in the Arctic? #47

Closed jhammock closed 7 years ago

jhammock commented 8 years ago

I'm not actually sure of this comparison; I just realized I can't read effechecka query urls as well as I thought, so I'm going mostly by the map projection images. I'm seeing a striking discrepancy in # taxa returned between effechecka and iDigBio.

I compared this query: http://www.effechecka.org/#height=168&lat=81&limit=20&lng=-180&taxonSelector=Mammalia%2CMammalia&traitSelector=bodyMass%20%3E%2010%20kg&width=510&wktString=ENVELOPE%280.703125%2C-0.703125%2C90%2C65.07213008560697%29&zoom=1

with this iDigBio query, which looks similar:

The download you requested from iDigBio is ready and can be retrieved from: http://s.idigbio.org/idigbio-downloads/99050c65-9501-4f63-a59d-f55d449ce4ba.zip

The query that produced this dataset was: {"core_type": "records", "rq": {"geopoint": {"type": "geo_bounding_box", "bottom_right": {"lat": 66, "lon": 179.99999}, "top_left": {"lat": 89.99999, "lon": -179.99999}}, "class": "mammalia"}, "form": "dwca-csv", "core_source": "indexterms", "mediarecord_fields": null, "record_fields": null, "mq": null}

The iDigBio results include 371 taxa, ~300 of them species or subsp), and the effechecka results include 66 species. I expect a lot of the difference may come from names filtering (there were other odd things in the iDigBio list like question marks, etc. I can't decipher the minimum latitude from the effechecka url, so it may not be as comparable to the iDigBio (>66 degrees N) as I thought. If that's the case, then the projection on one of the two map interfaces may be deceptive near the poles. Wikipedia places the 66th parallel in a position that visually resembles both these queries.

Minutiae? -I selected two "mammalia" from the effechecka menu, but it's always possible this didn't capture everything in the intended group. Oh, names... -GBIF is excluded from the comparison because their map query interface makes me sad. Anyway, I daresay it overlaps with iDigBio quite a bit.

jhpoelen commented 8 years ago

Just checked the envelope you used in the effechecka URL and it does look similar to the iDigBio query you entered "ENVELOPE(0.703125,-0.703125,90,65.07213008560697)". Also, I wrote a unit test (see code commit) to check that it includes at least one point in the arctic (ie. (lat,lng) (89.5, 20.0)). The minimum latitude from the envelope you used is about 65.1, about 0.9 removed from 66.0.

The idigbio version used by effechecka is from June 2015 and many records have been added since. Last week @godfoder mentioned he was working on getting a more recent version of idigbio (Nov 2015) available to effechecka.

A quick glance at the gbif zip file you shared, tells me that a bunch of specimen have been added in late 2015 (using field idigbio:dateModified). This suggests that the discrepancies you found are due to a using an older version of idigbio.

Two questions:

  1. What do you think would be an appropriate mechanism to indicate the version of the archives being used in effechecka (or soon, freshdata)?
  2. How often would you expect the archives to be updated?
jhammock commented 8 years ago

How did I not think of that possibility? Okay, this raises a good question. For checklist purposes, I don't think speed is of the essence. For Fresh Data purposes, introducing speed for iDigBio data might be a false promise anyway, since collections institutions often take a significant amount of time to process and submit their data. There may be no need to strive to update every week, for instance, when data take several months just to get to iDigBio. The iDigBio team will have more accurate ideas about that, I expect.

Looking forward to the update anyway, @godfoder :) It looks like you all have been very busy since last summer.

jhammock commented 8 years ago

I think a brief datestamp annotation will suffice to indicate the version of each data source contributing to query results. (eg: iDigBio v. 06/15/2015, or something.) We might play with where we put this. Offhand, I'm thinking in the text on the search page, and slipped into the downloads. In the header row of the csv, after all the column labels? I'm not sure what json etiquette calls for, but perhaps something similar will do there? And appended at or near the end of descriptive text of any services, like the auto collection description text for the EOL collection generator?

jhpoelen commented 8 years ago

@jhammock please confirm that this issue is out of scope of the prototyping effort.

jhammock commented 8 years ago

Correct, this is not a pending deliverable for currently active projects.

jhpoelen commented 7 years ago

closing - out of scope.