paleobiodb / data_service

The PBDB Data Service, API and table/system maintenance scripts
Artistic License 2.0
12 stars 0 forks source link

discrepancy between returned early and late intervals if calling up occurrences vs taxon names #31

Open meljh opened 6 years ago

meljh commented 6 years ago

I have been using the PaleoDB to compile stratigraphic ranges for taxa and have discovered that it is possible for the early and late intervals returned for a taxon (when downloading taxonomic names) to comprise a different/longer amount of time than implied by the ages of the collections that taxon is listed in (when downloading occurrences). Here is an example:

http://www.paleobiodb.org/data1.2/occs/list.txt?base_name=Acidaspidina%20plana&show=class,time&idqual=certain

will return 4 records, all occurrences assigned to the Maduan, currently with max_ma of 501 and min_ma of 498.5 in the database.

In comparison: http://www.paleobiodb.org/data1.2/taxa/list.txt?base_name=Acidaspidina%20plana&show=class,parent,app&rel=current

will return a record for the taxon with the expected max_ma (501) and min_ma (498.5) but with early and late intervals as Drumian and Guzhangian, respectively, presumably because the Drumian is 504.5 to 500.5 and Guzhangian is 500.5-497.0, and thus comprise the max and min ages from the occurrences.

But if I wanted to apply an updated/different age model to the returned early and late intervals, this would result in a longer stratigraphic range (essentially less precise) for this taxon than is known from the occurrences. In this case, the range would also be inaccurate as the Maduan is currently within the Paibian, so this taxon is actually younger than the Guzhangian (the age assignments in the PBDB for this regional stage are out-of-date, not a surprise since this is the Cambrian, but only compounds the problem and would be impossible to correct by someone downloading ranges via taxonomic names).

mmcclenn commented 6 years ago

Yes, you have highlighted an important point, which perhaps I should better explain in the documentation. The first and last occurrences reported by the taxa/list operation are expressed according to the international chronostratigraphic timescale, rather than the time intervals that were originally entered.

This was a deliberate choice, so that this information would be presented for all taxa using a single consistent timescale. In general, if you need exact information about occurrences in the PaleoDB, it is always better to query for them directly as you have done and then analyze the resulting dataset yourself.

meljh commented 6 years ago

I wonder if it would be worth starting a FAQ for things like this that might demand more documentation than the explanatory text currently online for different input/output parameters? This way examples could be included as well. The one above would be something like "When downloading lists of taxon names, why is there sometimes a discrepancy between the early and late intervals returned for a taxon and the absolute ages returned for the same taxon?"

dwbapst commented 5 years ago

Why is the treatment of first/last appearance times for taxa treated different than occurrences?

mmcclenn commented 5 years ago

The reason for this is that the taxon record records whatever is entered as the first/last appearance by the person who entered it, presumably according to the current literature. This is independent of the recorded occurrences of the taxon in the database.

Admittedly, that is somewhat problematic, as the dates recorded when the taxon is entered will never change unless somebody explicitly goes back and edits the taxon. I think this is a relic from the early years of the database, and could probably be dispensed with.

ā€” Michael

On Jul 18, 2019, at 8:35 AM, David Bapst notifications@github.com<mailto:notifications@github.com> wrote:

Why is the treatment of first/last appearance times for taxa treated different than occurrences?

ā€” You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/paleobiodb/data_service/issues/31?email_source=notifications&email_token=AACLML3WH2QRLORXXIIYVVLQABWSJA5CNFSM4D5TEPM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2IPZUA#issuecomment-512818384, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AACLMLZRQJPCO3FV2LBYCOLQABWSJANCNFSM4D5TEPMQ.

dwbapst commented 5 years ago

@mmcclenn:

The reason for this is that the taxon record records whatever is entered as the first/last appearance by the person who entered it, presumably according to the current literature. This is independent of the recorded occurrences of the taxon in the database.

šŸ¤”

but wait, you said

The first and last occurrences reported by the taxa/list operation are expressed according to the international chronostratigraphic timescale, rather than the time intervals that were originally entered

So, let me see if I understand: a) Occurrences are dated according to the interval listed on the occurrence, as entered by an enterer, possibly later revised, etc. These original intervals are reported by the API. The dates for those intervals are those interval dates according to the current time-scale used by the PBDB.

b) When we call a taxon, it uses the first and last intervals as listed on that taxon, usually those entered by the person who entered that taxon. (Without reference to updated collections/occurrence data??) Or are the dates themselves taken from what original enterer's have added in?

c) Furthermore, those ages are then assigned to... other intervals, as in @meljh's example? And that's so all the intervals returned for first/last interval by taxa/list are on the international scale (presumably the Maduan isn't part of the international scale - I don't know that, I don't work in the Cambrian...). So the PBDB tries to return intervals that best comprise the dates listed for the intervals originally listed for that taxon.

Am I missing something? Or did you mean to say the age/interval information for collections/occurrences is as most recently entered, and so the data reported for occurrences/collections is closer to the data-as-is?

dwbapst commented 5 years ago

@mmcclenn:

The reason for this is that the taxon record records whatever is entered as the first/last appearance by the person who entered it, presumably according to the current literature. This is independent of the recorded occurrences of the taxon in the database.

šŸ¤”

but wait, you said

The first and last occurrences reported by the taxa/list operation are expressed according to the international chronostratigraphic timescale, rather than the time intervals that were originally entered

So, let me see if I understand:

  1. Occurrences are dated according to the interval listed on the occurrence, as entered by an enterer, possibly later revised, etc. These original intervals are reported by the API. The dates for those intervals are those interval dates according to the current time-scale used by the PBDB.

  2. When we call a taxon, it uses the first and last intervals as listed on that taxon, usually those entered by the person who entered that taxon. (Without reference to updated collections/occurrence data??) Or are the dates themselves taken from what original enterer's have added in?

  3. Furthermore, those ages are then assigned to... other intervals, as in @meljh's example? And that's so all the intervals returned for first/last interval by taxa/list are on the international scale (presumably the Maduan isn't part of the international scale - I don't know that, I don't work in the Cambrian...). So the PBDB tries to return intervals that best comprise the dates listed for the intervals originally listed for that taxon.

Am I missing something? Or did you mean to say the age/interval information for collections/occurrences is as most recently entered, and so the data reported for occurrences/collections is closer to the data-as-is?