Addition of a COLLECTION node

aazaff commented 2 years ago

We have decided that we need to add a new type of node called a COLLECTION.

The purposes of this are twofold:

Will help with organising data and filtering searchers for users.
Will maintain consistency with the Paleobiology Database.

Current Design

All COLLECTIONS must have at least one REFERENCE. This is to maintain consistency with Paleobiology Database design.
An empty COLLECTION cannot be made public.
A COLLECTION cannot be deleted if it has child SPECIMEN nodes. The SPECIMEN nodes would have to be deleted first.
A COLLECTION is tied to SPECIMEN nodes, NOT to DESCRIPTION nodes of either OTU or SPECIMEN types.

Outstanding issues

The term COLLECTIONS is used in many different ways in different databases. We have stuck with COLLECTION for consistency with the Paleobiology Database and because other alternatives (e.g., SAMPLE) are also potentially confusing. Are we content with this choice?
A collection in the paleobiology database is primarily a group of what we would call OTUs with secondary support for specimens, but a collection in pbot is primarily a group of specimens with secondary support for OTUs. Is this mismatch a problem? (I realised I was being stupid here and this is not a problem because we DO required Linnaean terms of some kind with specimens, which is equivalent to PBDB occurrence)
What properties will be required/encouraged/optional as part of COLLECTION node?

aazaff commented 2 years ago

As a current tentative design, I have made the following preliminary list of Required, Encouraged, Inappropriate, and optional fields for COLLECTIONS.

Passive

pbotID

Required

Stratigraphic Unit -- The most highly resolved stratigraphic unit - e.g., if your section has a known Group, Formation, Member, Submember, and Bed, you would put the Bed name here. Latitude -- in WGS84 Decimal Degrees Longitude -- in WGS84 Decimal Degrees Name -- a free text name (Required for PBDB) Early Interval -- The earliest geologic interval Late Interval -- The latest geologic interval Protected -- Boolean, is this considered a legally protected site Paleoenvironment -- The paleoenvironment interpretation (indeterminate allowed).

Encouraged

Paleobiology Database Link -- Link to an equivalent PBDB collection Numeric Max Age -- Age in Myrs if associated with specific geochron measurement Numeric Min Age -- Age in Myrs if associated with specific geochron measurement Group -- Geologic Group Formation -- Geologic Formation Member -- Geologic Member Bed -- Geologic Bed Geography Comments -- Notes on how to find the location Stratigraphy Comments -- Notes on the stratigraphic nomeclature used Geologic Comments -- Notes on the lithological context of the section Preservation Comments Notes on the taphonomic situation Collectors -- Names of actual field collectors

Inappropriate

collection_no -- would be assigned by pbdb, not our issue record_type -- is automatically a collection, this is a stupid field collection_subset -- although we could support this, it is so rarely used in pbdb I don't see it as worth it collection_aka -- no need for this n_occs -- implicit ref_author -- implicit in references node ref_pubyr -- implicit in references node reference_no -- implicit in reference node cc -- country code, implicit in lat long state -- implicit in lat long county -- impliict in lat long paleomodel -- handled by pbdb paleolng -- handled by pbdb paleolat -- handled by pbdb geoplate -- hanlded by gplates taxonomy comments -- redundant with rest of pbot

optional

Everything else tracked by pbdb but not listed above.

NoisyFlowers commented 2 years ago

The Current Design described above is in api commits ef0a0936d206d41000df98aabdada7c3190ae38a through 33e5f2d2bf68fdbede9e3af298127828b4337831, and client commits 8982749adf9c86e6775e57821ad3037bca1be56b through 612d9db87a7e024df335681fe3eb7b8faaac7ec0.

I'm leaving the properties at pbotID and name for now.

doricon commented 2 years ago

I think collection_subset could be useful. Taking multiple subsamples (of even a single bed) at a single field site is not an uncommon collection strategy for paleobotany.

For example, I collected an in-situ flora from a laterally continuous tuff layer, and took 26 subsamples of the flora across the exposure. I report in a publication the entire flora all together because it was a living community - but knowing the subsampling is helpful for spatial analyses and various diversity/heterogeneity metrics. So in PBot/PBDB, I would prefer to enter the flora as a collection, and make record of the subset.

Everything else in your list looks good to me!

doricon commented 2 years ago

"...we DO required Linnaean terms of some kind with specimens, which is equivalent to PBDB occurrence"

About requiring linnaean terms for specimens - Do you mean just at some level (e.g., Plantae, or some higher order clade name)? We definitely don't want to require family, genus, or species, right?

aazaff commented 2 years ago

Yes... Linnaean at SOME level, not necessarily genus species.

ecurrano commented 2 years ago

Did we talk previously about having a "Reason for Collection" field? Might be important to know if something was a quantitative census vs. taxonomic collection vs. biostratigraphic investigation (and other categories, too).

doricon commented 2 years ago

@ecurrano I didn't actually know how to interpret "reason for collection"! I would like having the investigations method/purpose info you listed here. We would need to provide clear instructions of what to put in that field, and not to put in that field! I could see someone typing their whole study rationale in that field.

clairecleveland commented 2 years ago

Less specifically than last couple of comments, It sounds like having the flexibility for a collection to mean a myriad of things is necessary to be in line with PBDB and our term collection would port to PBDB. However within PBOT, the highest level of collection could be broken down into sub collections which is dependent on the enterer (we would provide some training set of guidelines).

It does bring up the concern of when a new collection is added as the highest level, but turns out it will later become part of an even higher level collection. But something tells me PBDB has already had experience with this and there is a solution or workaround?

ecurrano commented 2 years ago

Ellen, Claire, and Rebecca discussed this more on Wednesday. Here is a summary of our discussion, and we look forward to feedback!

Collection node = the fossils collected from a particular hole at a particular point in time by a particular team or individual (inherent in this is that different teams might have different reasons for collection) => we will encourage everyone to enter their data at this resolution.
Super Collection node = amalgamation of all the collections taken from a specific locality (e.g., Colwell Creek Pond, Big Cedar Ridge)
the use of collection & super collection vs sub-collection & collection should match PBDB. We are unsure whether that means PBDB gets the “Collection node” or the “Super Collection node.”
Anything above the level of “Super Collection” will be achieved through querying- e.g., formation-level or specific geographic region.
Allow duplication of nodes so one can adjust minimally Command D:duplicate is our friend!
Req./Rec./etc. for collection properties: pause until we meet with the full team and confirm collection terminology

ecurrano commented 2 years ago

Also, replying to Dori's comment above. Yes! Maybe we can have a drop-down menu for investigations method/purpose and clear explanation of each category we put in that drop-down menu.

aazaff commented 2 years ago

Great, here are some preliminary responses.

Collection node = the fossils collected from a particular ~~hole~~ sample at a particular point in time by a particular team or individual (inherent in this is that different teams might have different reasons for collection) => we will ~~encourage~~ everyone to enter their data at this resolution.

My main concern is with the use of the word encourage above. This implies that it will still be possible for a user to enter data not at this resolution. How do you expect that to work from a technical standpoint?
Conversely, if we do not support alternative resolutions, what do users with data at those other scales do?

Super Collection node = amalgamation of all the collections taken from a specific locality (e.g., Colwell Creek Pond, Big Cedar Ridge

Is there any benefit to calling it a super collection as opposed to a locality?
How does this address the issue of different teams/projects that collect from the same locality? Would each project have the same or different "super collection" nodes?
How will you define a the size of a locality? For example, for my M.S. thesis I had three roadcuts that I broke out as three localities, but they were easily walking distance from each other and could have reasonably been bundled up as one.
It should be understood there can be no validation of any of this on our side because there's no way for us to know what the sizes are or would be. So this will be entirely honor system.

doricon commented 2 years ago

The definition of a collection node, as currently written, seems very specific to a certain collecting style and material. Namely, it makes perfect sense for the discrete quarry-style sampling of compression/impression fossils (currently used by all of us and others in more quantitative paleobotany circles!). I am not sure if that definition is inclusive enough of the range of plant materials/modes of preservation/collection styles for the entire community.

For example, nodules containing plants are not collected from holes or quarries, but often surface collected - what constitutes a sample is, by necessity, going to be determined by the researcher based on the realities of the site and their study aims. There are even many situations where plant compressions/impressions are not collected from distinct holes that are separately recorded - for example, when I did some field work with Nacho and Ruben in Argentina, they just collected all the plants found at the site/locality by many people working on different spots on the hillsides; this is common for groups that are sampling for taxonomic diversity and systematics and not heavily into quantitative paleoecology/paleobotany. There are also many collections of specimens obtained from float. I am also not sure how palynology would work with this description - for example, should a core be divided into time/measured increments as collections? And then there are the historical collections, of course, which are generally/rarely recorded as holes or quarries and were likely just collected from large areas. There are more examples that I am not touching on, like collections from coal balls, from lignite mines, and so on. All of this to basically say that the definition of a collection has to be accommodating to the many various types of preservation/specimens, researcher aims, and messy realities of collecting. I think that means ultimately having to rely on researchers to determine the smallest, meaningful, and realistic partition of their specimens/data, beyond the basic guidelines of being collected by a particular individual/group at a particular time (this speaks a bit to Andrew's question #3).

A possible (but incomplete and needs work) modification: Collection node = the fossils collected as a sample at a particular place and point in time by a team or individual. The sample should represent the smallest reasonable division of your data (e.g., a single quarry sample, a surface collection of nodules from a single site, .....[need other examples, but you get the gist!]

I like the concept of a "super collection" node for internal PBot use! I am guessing that the smaller-partitioned "collection" concept is what would pipe to/track better with PBDB.

aazaff commented 2 years ago

Dori articulated my thoughts so much more clearly. Yes, there are so many different methods of collection out there. We really want to err on the side of inclusivity.

doricon commented 2 years ago

Relevant notes on this topic from meeting with Mark:

"Collections" are the biggest sticky issue in PBDB -Mark's view is a collection is "temporally-ecologically bound" [note: I like this wording of the concept] -People should conceptualize them as small as they can

-PBDB does not differentiate collection date - so why make a new collection for different dates? Mark's perspective is that there is no point. But all this is dependent on things like, is it the same spot?? Different collectors can make a difference. [note: my opinion would be that different collection methods should warrant making new collections]

-PBDB is very flexible, which is a blessing and a curse

-"you can never go wrong making more collections, only by making too few" -"when in doubt, part it out!!"

Andrew: think about the goal of the database. If it is a constrained scientific goal/aim, then what a collection is should be constrained; if the use of the database is broad/unconstrained, then the definition should be unconstrained.

Regarding specimens and/or names not clearly in a collections or names without attached specimens. Mark's rule of thumb when entering data (for global scale analysis): if he can get down to cenozoic epic and a county, he will enter it. -the data can be in the database at the level that it exists (sometimes crappy), but depending on your analysis and query, it gets filtered out when not relevant!

-for PBot, for filtering based on collection methods see the collections entry form, last tab, called "collecting info". and our system, could say that you must answer these questions! PBDB can add checkboxes or whatnot for us

-From Andrew about nesting collections: we provide in our graph system the ability to have unlimited hierarchy of collections, and the most finely parced goes to PBDB. Mark full-on agrees here. -Use darwincore for terms! There is a conversion table for PBDB to darwincore, Mark will try to find

aazaff commented 2 years ago

We will allow unbounded nesting of collections of collections (as we do with states and characters). The only question is whether we separate out the concept of darwind collection_location as a separate node type or just keep it as a collection type but specify it with a property - like how we distinguish between OTU and Specimen descriptions. I am leaning towards the latter, but we have not decided on this yet.

paleobot / pbot-dev