paleobiodb / data_service

The PBDB Data Service, API and table/system maintenance scripts
Artistic License 2.0
12 stars 0 forks source link

Gplates paleocoordinates are sometimes blank #11

Closed markuhen closed 7 years ago

markuhen commented 7 years ago

The Gplates data service sometimes fails to return paleocoordinates. This is likely because the service doesn't have paleocoordinates for those modern coordinates.

I think our data service should do the following instead of returning nothing:

  1. return the Scotese coordinates (appropriately labeled) instead of the Gplates coordinates
  2. if neither Gplates nor Scotese have any paleocoordinates, we should return something like NaN
aazaff commented 7 years ago

Hi @markuhen

@jczaplew and I did some test cases, and we confirmed that the data service is working correctly. The problem is that these points are not rotatable using GPlates.

As per your suggested fixes: 1) I strongly advise against returning two different rotation models within the same field.

2) The data service already returns an explicit NULL when a rotation fails. So if we want to flag those as NaN, NULL, or NA within the PBDB, then @mmcclenn should do that as part of his download script rather than us changing the data service.

jpjenk commented 7 years ago

@aazaff, the problem here is that for large spatial analyses, the GPlates issues will render otherwise valid data unusable. @markuhen showed me a large data table that he was assembling yesterday and, by his rough approximation 5% of the paleo-coordinates could not be calculated with GPlates. He was in the process of laboriously filling these data with values approximated by Scotese. The only other alternative would be to eliminate these points which is unacceptable because the actual data is good.

The question then becomes, is Scotese so wildly inaccurate that this would propagate errors or is it good enough? A comparison of points rotatable by both models should be made - I can do it at some point if it has not already been done - to properly answer this. However, the sense is that this is not the case. GPlates is a higher resolution model and if a particular plate fragment in paleo-time can’t identified to within a comfortable degree of accuracy for it's authors, the model will return NULL. We believe however, that Scotese will return an approximate location, if not the specific plate sliver in these cases, which is good enough. Aspects of datasets are filled and/or interpolated all the time for consistency. In this case the plate model used would be identified in the field accompanying the estimated paleo-coordinates so if these points happen to stand out suspiciously in whatever general analysis one is doing, geolocation error as a possible culprit is easy to explore.

Since the methodology is documented and the source identified within returned datasets, I see no problem with a tiered, two plate model approach to make sure there is always a paleo-coordinate associated with each occurrence.

@jpjenk

On Mar 8, 2017, at 4:19 PM, Andrew Zaffos notifications@github.com wrote:

Hi @markuhen

@jczaplew and I did some test cases, and we confirmed that the data service is working correctly. The problem is that these points are not rotatable using GPlates.

As per your suggested fixes:

• I strongly advise against returning two different rotation models within the same field.

• The data service already returns an explicit NULL when a rotation fails. So if we want to flag those as NaN, NULL, or NA within the PBDB, then @mmcclenn should do that as part of his download script rather than us changing the data service.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

aazaff commented 7 years ago

@jpJenkins

  1. There is no reason for @markuhen to have to laboriously convert to Scotese. We have always offered the option to download Scotese rotations instead of gplates coordinates through the API, and I believe also through the download form. If that option is not working, then that is a separate bug.

  2. Mixing models within a single field is bad scientific practice, regardless of how similar the model outputs might be. Especially when that fact is going to be buried in obscure documentation that nobody ever reads. Most people still think return Scotese by default, for example. Honestly, it would be better for us to roll back to Scotese entirely than mix models within the same field. I know of no field of science where mixing like that is permissable. But, if users want to use both, they can always download the Scotese coordinates from PBDB and do a simple join.

  3. The issue between Scotese and GPlates are not accuracy/precision so much, though there are differences. The difference is that GPlates is a true model in the sense that it is algorithmically defined based on hypotheses about the fundamentals of plate movement. Scotese maps are manually curated products that are not attached to a reproducible model. This is why shanan insisted on the initial shift over.

markuhen commented 7 years ago

Yes, you can download Scotese instead of Gplates. But, to have one data set, you have to mash them together to get the best of both worlds. Also, in a given download, you have to pick one or the other (using the downloader), so maybe we should just allow you to pick both to allow the mashing together to be easier.

Let's plan to chat about this next week when I am in town, and figure out the best solution to this issue.

dwbapst commented 7 years ago

Just a thought from the random peanut gallery following this repo, but I concur with Andrew: it's bad database policy to mix data sources in output, especially when filling in the gaps in one with the other could be done with about 2 lines of R code for the 'mashing together'. When in doubt with scientific software and databases, it is much better to force users to do trivial tasks themselves than not, or else you run a serious risk of people not understanding what they are doing. Allowing both to be obtained via the download form is a good middle ground solution.

Cheers, Dave B

On Thu, Mar 9, 2017 at 8:44 AM Mark D. Uhen notifications@github.com wrote:

Yes, you can download Scotese instead of Gplates. But, to have one data set, you have to mash them together to get the best of both worlds. Also, in a given download, you have to pick one or the other (using the downloader), so maybe we should just allow you to pick both to allow the mashing together to be easier.

Let's plan to chat about this next week when I am in town, and figure out the best solution to this issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paleobiodb/pbdb-new/issues/11#issuecomment-285388105, or mute the thread https://github.com/notifications/unsubscribe-auth/ABfVL6jnpGaBNaM8p48I3mKZSKDBtnDoks5rkB5jgaJpZM4MXS6D .

-- David W. Bapst, PhD Adjunct Asst. Professor, Geology and Geol. Eng. South Dakota School of Mines and Technology 501 E. St. Joseph Rapid City, SD 57701

http://webpages.sdsmt.edu/~dbapst/ http://cran.r-project.org/web/packages/paleotree/index.html

jczaplew commented 7 years ago

In my mind this is a bit of a philosophical argument over whether APIs are for data delivery or data delivery and analysis.

I know the PBDB API edges into the realm of analysis with routes that allow users to do things like generate data for diversity curves, but typically I believe APIs are primarily for fetching data. Analysis and processing is typically left to the user, or an intermediate package that handles common functions (like velociraptr or the PBDB r package).

aazaff commented 7 years ago

@markuhen idea of having a parameter option for returning multiple paleocoordinate models is a great idea. We could actually split GPlates into the Seton and Wright models, and add more models as they come out. So there would be something like a wright_paleolat, seton_paleolat, and scotese_paleolat field for people interested in getting them all.

?show=paleoloc&allrotations=TRUE (or whatever)

mmcclenn commented 7 years ago

Actually, the data service does allow you to download both gplates and paleocoordinates at the same time. I just didn't put that capability into the download form because I was trying not to complicate it more than necessary. If you just add the parameter "&pgm=gplates,scotese" to the download URL, you will get both sets of coordintates. Or if the pgm parameter already appears in the URL, change the value to what I indicated.

If it seems like people will want to do this, I can pretty easily add that option to the download form.

-- Michael

On Mar 9, 2017, at 9:44 AM, "Mark D. Uhen" notifications@github.com<mailto:notifications@github.com> wrote:

Yes, you can download Scotese instead of Gplates. But, to have one data set, you have to mash them together to get the best of both worlds. Also, in a given download, you have to pick one or the other (using the downloader), so maybe we should just allow you to pick both to allow the mashing together to be easier.

Let's plan to chat about this next week when I am in town, and figure out the best solution to this issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/paleobiodb/pbdb-new/issues/11#issuecomment-285388105, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAS2L1i7F_GNvA_WNsFyFKcTRDbwxv1dks5rkB5jgaJpZM4MXS6D.

markuhen commented 7 years ago

I just noticed that the GPLates return for Holocene collections are blank. We should decide what to do with this. Should we fill in with the modern coordinates, even if they don't come from GPlates?

aazaff commented 7 years ago

That actually sounds like a bug with our data service. The Holocene should return modern coordinates. @jczaplew and I will look into it.

aazaff commented 7 years ago

The Holocene rotation issue has been filed. UW-Macrostrat/gplates-reconstruct#5.

aazaff commented 7 years ago

Hey @mmcclenn, as we discussed during @markuhen's visit, can you please close this issue once you've implemented the new API parameter for returning multiple paleocoordinate systems. I want to update velociraptr to use the new path once its up and running.

mmcclenn commented 7 years ago

@aazaff that parameter is already part of the API. You can specify "pgm=model1,model2,..." where the available models are "gp_early", "gp_mid", "gp_late", "scotese". Also, "gplates" is a synonym for "gp_mid". This is already available on paleobiodb.org.

mmcclenn commented 7 years ago

In the new release due out next week, collections whose paleocoords are blank will have the following label in the "geoplate" field: _coordinates not computable using this model. The coordinate fields will still be blank, because as @aazaff noted putting text into a field where numbers are expected is a bad idea. The "geoplate" field should be interpeted as unstructured text.

aazaff commented 7 years ago

I am no longer sure that changing the geoplate field is the optimal solution. It is technically a breaking change. I was also under the impression that we had agreed on an alternative solution when Mark last visited. I would like us to discuss this again at our next meeting.

mmcclenn commented 7 years ago

The behavior of the data service is now to report "cannot be computed under this model" in the "geoplate" field when the coordinates are blank. It is reported there instead of in the coordinate fields since they ought to be either empty or a valid coordinate where "geoplate" is an unstructured text string.