ropensci / pangaear

R client for the Pangaea database
https://docs.ropensci.org/pangaear
Other
21 stars 10 forks source link

pg_search: searching for datasets that cross 180/-180 #71

Closed sckott closed 4 years ago

sckott commented 5 years ago

someone raised issue that some datasets are hard to get in results e.g,. https://doi.pangaea.de/10.1594/PANGAEA.898389

I think it's because they cross 180/-180, but not sure. e.g.,

pg_search(query = "pollen", bbox = c(51.8, 42.3, -171.7, 74.6))

with that bbox, it should find the dataset above, but does not.

If you just remove the bbox search it does find the dataset

pg_search(query = "standardized fossil pollen data from Siberia")
karawoo commented 5 years ago

I can reproduce but am not sure beyond that. Crossing 180/-180 seems likely the issue. I'll note that searching for "pollen" on https://www.pangaea.de/ and setting the bounding box there also prevents this record from being returned. So I think this might be an API issue, not a pangaear issue.

kbh022 commented 5 years ago

Reversing the order of longitudes with number of counts 60 finds it. pg_search(query = "pollen", bbox = c(-171.700000, 42.320000, 51.840000, 74.550000), count = 60) But, then most of the datasets that are pulled together with it are from the west. Otherwise, even a composite query such as: "sediment pollen fossil", "sediment pollen", "fossil pollen", and "pollen", each with 500 counts and offset 0, 500, ...2500 cannot find the dataset. So, it seems as a 180/-180 issue.

sckott commented 5 years ago

Thanks @karawoo I've also tried searching with longitudes in a 0-360 to see if that works, but doesn't. It's not clear if the bbox in the "Coverage" section of a dataset has to be completely encompassed by a search or not. I imagine it does have to be since this search doesn't find that dataset?

sckott commented 5 years ago

@kbh022 that bbox c(-171.700000, 42.320000, 51.840000, 74.550000) does seem more correct, since it should be minlon, minlat, maxlon, maxlat.

uschindler commented 5 years ago

Hi this looks like an issue on PANGAEA's side. It happens for datasets which cross dateline (its bbox). Our code does queries including date line correct, but the combination is broken.

I will work on a fix.

The Soap Api of PANGAEA allows 3 types of searches: intersection, full included and mean only. The web site only offers the first variant, so the bounding boxes of dataset and query need to overlap for a match. The score is ranked by distance between center of search box and mean point of dataset. This will score datasets that overlap more with higher factor. This is why you see different order if you invert the box. It then matches (because of this bug), but the score gets very low (as it's far outside the inverted box).

sckott commented 5 years ago

Thanks for this information @uschindler and for working on a fix. I'll add to the package documentation how the bounding box search is done (w/ intersection)

while you're here, curious if the Data Warehouse downloads are available programatically, or only in the web interface?

uschindler commented 5 years ago

The fix does not seem easy. It affects all datasets which cross date line.

Your second question: the data warehouse is only available to users logged in, so you need a login token. But this will be available soon: users can create an api token (like on GitHub) that can be used to download datasets on behalf of some user. This allows to create and share scripts like a pangaear or pangaeapy script without including username and password.

The API for the data warehouse is included in our Soap Api, it's not available via REST yet.

sckott commented 5 years ago

Okay, will look out for the warehouse token update

uschindler commented 5 years ago

Just some update: Hi we can't fix the dateline issue at the moment easily, as this is a bug in the underlying Elasticsearch engine, which is not yet fixed: https://github.com/elastic/elasticsearch/issues/22564

We may change to polygons, but that slows down.

sckott commented 5 years ago

Thanks for the update.

uschindler commented 5 years ago

Hi, The issue was fixed on PANGAEA's API. Searching for datasets with bounding boxes crossing the date line ist now fully supported. Precision for search is 5km or 2.5% of size of shape (if large).

sckott commented 5 years ago

great, thanks very much!

sckott commented 4 years ago

working now, closin

kbh022 commented 4 years ago

Great! Thanks a lot!

Kuber

On Wed, Jan 22, 2020 at 5:48 PM Scott Chamberlain notifications@github.com wrote:

working now, closin

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/pangaear/issues/71?email_source=notifications&email_token=ALK3UISHT4M5V2UVYDBUWNLQ7B2HPA5CNFSM4IVJ4AP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJUJF4A#issuecomment-577278704, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALK3UIV3GAAEXI2DOYIQSMTQ7B2HPANCNFSM4IVJ4APQ .