request for replicate term (BODCNVS-1790)

kmexter commented 1 year ago

Request for a term to cover one or both of the following

fieldReplicate: being a replicate sampling device that is used for sampling replicate (i.e. close-by) areas of space and time. I am requesting this for use with Autonomous Reef Montoring Structures (ARMS) units, which are settlement plates that are used on the hard bottom of marine environments, but it could be applicable to any type of sampling device, where more than one is placed in the environment in order to be field replicate collecting devices.
sampleReplicate: mainly to distinguish from the field replicate, this would be replicates taken from one physical sample (in our case, from one ARMS unit) created from one collecting device.

danibodc commented 1 year ago

Hi @kmexter

Thanks for your request. What kind of parameters are you looking to define with these replicate terms, i.e. what sort of measurements do they relate to?

kmexter commented 1 year ago

Hi In principle I would prefer it to be as broad a definition as possible, so that others can use it. The data we have that require this terms are counts (and DNA) from physical samples (organisms) that were collected via sampling devices in the field. In our case it is for marine "automomous reef monitoring systems" but this can apply to any collection device that is used to collect organisms.

kmexter commented 1 year ago

Field replicates would then be when collecting devices that are otherwise identical are placed near enough to each other that they sample the organisms in a replicate way. Sample replicates would be for the situation where the collected organisms are split into units and separately handled - in our case, we scrape the plates that the organisms are sitting on, blend them into mush, filter etc into several different units, and these units the serve as replicates of the collected sample material. So field samples are replicates of the collecting device, you could say, and sample replicates are of the collected material Pardon my awkward phrasing - I am a data specialist, not a biologist.

kmexter commented 1 year ago

Following a meeting we had, I copy here the current preferred version of these definitions. We will discuss it internally in Oct when we go to ao GEO BON meeting (where we can solicit other opinions).

fieldReplicate: Where multiple samples are taken in order to replicate the sampling of an area ("field"). These are used to provide a statistical sub-representation of the full area from which the sampling is done. sampleReplicate: Where a single collected sample is subsequently separated into subsamples, to be processed and analysed independently.

kmexter commented 1 year ago

Hi So after some discussions and some advice from Gwen, I submit the following modified terms

fieldReplicate: The identifier of the area ("field") from where multiple material samples have been independently collected, specifically so as to statistically estimate the sampling accuracy and representation of the larger area in which the sampling is done.

sampleReplicate: The identifier of the collected material sample that is separated/divided into subsamples (at any point) to be processed and analysed thenceforth independently, so allowing for separate upstream handling for each replicate.

FYI 1- whether this goes on P01 or somewhere else, I leave up to the BODC experts to decide, because for my use it does not matter 2- "my use" is to be able to add these terms to the EMOF of our data submitted to EurOBIS, to be able to say "this row (sample) is a sample|field replicate of this (other) materialSample"

If you want more discussion, feel free to email me.

gwemon commented 11 months ago

Hi @kmexter thank you for the update. So if we were to create P01 codes for these, they would be of the form: Identifier of field replicate where Identifier comes from S06 and "field replicate" is a new term in S18/S29 with proposed definition: "area ("field") from where multiple material samples have been independently collected, specifically so as to statistically estimate the sampling accuracy and representation of the larger area in which the sampling is done."

The second P01 would have the following combined preferred label: Identifier of sample replicate with sample replicate defined in S18/S29 as: "Collected material sample that is separated/divided into subsamples (at any point) to be processed and analysed thenceforth independently, so allowing for separate upstream handling for each replicate."

I have asked the OBIS Vocab Team to comment on this proposal. Anybody else welcome to comment. Many thanks.

rubenpp7 commented 11 months ago

Hi,

If I understood well, one of the differences between the two terms that are being proposed is that:

"fieldReplicate"s are expected to be processed in the same way after being subsampled (This is the actual definition of "replicate" subsamples)
"sampleReplicate"s seem to be assumed to be treated separately, as in, one checked for eDNA, and another for nutrients analysis, and another for visual census of organisms. (These are NOT replicates, they are just subsamples of the sample parent event)

Another difference between the terms refers to the parent event type being documented, either a field, or a "sample". However, samples can be subsampled and those sub-samples can also be subsampled and so on. Meaning that the event hierarchy structure doesn't always stop at only two levels.

If we want to create a broad term definition to document the fact that 2 samples are part of the same parent event AND that they can be statistically treated as such I think that a simple "replicateID" would work. In that way we would not need to worry about the specific event hierarchy structure of a given methodology. However, this replicateID field should only be used when two subsamples are real replicates and not just independently treated subsamples.

Regarding the documentation of the event hierarchy, this is more complicated since events can be nested infinitely depending on the methodology, so to keep it flexible enough, other initiatives such as Darwin Core have opted for creating the parentEventID field for each event so the event hierarchy is easily documented regardless of how complex it is.

On another note, the field replicates are now pointing at an area, but one can take replicate samples from spaces that are not areas such as volumes or transects also, therefore if a broad term wants to be created we should not specify the space being subsampled.

kmexter commented 11 months ago

Hmm, so your point about the sampleREPLICATE and fieldREPLICATE being actually slightly different definitions of replication - yes, I see, I had not considered that. For field we are assuming the sample is handled in the same way, for sample we allow that not to be the case (it can be or it cannot be), As such we need to change the definition of sampleReplicate to follow that of fieldReplicate better. Otherwise one can use the term sub-sample (is there such a term?) Given that, I would change the definition to

The second P01 would have the following combined preferred label: Identifier of sample replicate with sample replicate defined in S18/S29 as: "Collected material sample that is separated/divided into subsamples (at any point) to be processed and analysed in the same way, specifically so as to statistically estimate the sampling and processing accuracy."

rubenpp7 commented 11 months ago

mm that definition of replicate seems to point to the sample instead of to the subsamples which are actually the replicates.

found this definition for replicate samples that could be adapted for our case: Replicate sample means 2 equal aliquots taken from the same sample container and analyzed independently for the same constituent.

Also replicate samples are not always material. That kind of depends on what is the target data. For example an image with a bunch of organisms (I'm thinking on plankton imaging data) can be subsampled into an image for each organism. I guess that those could also be considered replicates in an analysis.

kmexter commented 11 months ago

Thumbs up to the first - so we change our sampleReplicate definition accordingly, OK BODC?

The second P01 would have the following combined preferred label: Identifier of sample replicate with sample replicate defined in S18/S29 as: "Collected material which are aliquots taken from the same sample container and analysed independently for the same constituent. "

For the second - had not thought of that, for us we are only dealing with physical samples. I am not sure, tho, that your example is replicate samples, as the idea behind this concept is that the aliquotes are supposed to be intrinsically the same.

sformel-usgs commented 10 months ago

Hi @kmexter, I'm the node manager for OBIS-USA, Chris Meyer shared this with me. Thanks for taking the reins on a complicated topic!

Unfortunately, I'm not sure if these terms will do much good for your eDNA data yet. Like @rubenpp7 describes, the DwC event hierarchy is generally how we (OBIS-USA/GBIF-US) advise trying to capture nested sampling structures with hierarchical events. But as far as I know, the Occurrence Core, which is what the DNA derived data extension requires, doesn't allow for description of nested events (or at least it won't pass validation in the IPT without an occurrenceID in each row). So, if I'm understanding everything above, I believe you would only be able to link the EMOF to sampleReplicate (the event during which your ASVs were observed), not fieldReplicate. Folks are aware of this limitation and are working on implementing a new data model.

That being said, it makes sense that you're working on this, these terms can apply for other ARMS data types, and I think these terms will be helpful to the community and eventually be useful for the eDNA too.

How to describe

My takeaway from the above discussion is that the two pieces of information people should be able to derive from these identifiers are (1) the experimental structure, and (2) the statistical dependencies of the data that are anchored to these identifiers.

When I tried to find good definitions of replication to lean on, I struggled (see examples at the bottom). I think how people use 'replication' depends quite a bit on their scientific background and they type of work they do. The four most common modifying themes were, 'biological', 'technical', 'spatial', and 'temporal'.

I'm leaning with @rubenpp7 that a general term of replicationID might be more useful by allowing flexibility, where the value of the ID would include the type modifier: e.g. field_rep1, sample_rep1, ARMS_rep1, plate_rep1, water_bottle_rep1, PCR_rep1, etc.).

My apologies if I've misunderstood any of the above conversation! Here are some suggested revisions, based on how I understand it:

P01 codes in the form: Identifier of replicate, where Identifier comes from S06 and "replicate" is a new term in S18/S29 with proposed definition: "independently collected samples, specifically collected to estimate information about a statistical population."

The second P01 would have the following combined preferred label: Identifier of technical replicate, where Identifier comes from S06 and "technical replicate" is a new term in S18/S29 with proposed definition: "subsample, or repeated processing, of a replicate, where replicate comes from S18/S29, to capture variation in handling of each replicate."

Would this work for ARMS?

I'm unsure because the ARMS protocol website describes sampling the entire unit for some protocols, and the individual plates for others. But here is one possibility:

Plate Scraping for eDNA, published with Occurrence Core, DNA extension, and EMOF:

parentEventID == identifier for the ARMS unit (with the caveat that it wouldn't have a corresponding eventID that describes the event)
eventID == identifier for the bulk fraction subsample resulting from this protocol that is the input for downstream eDNA analysis.
technical replicate == identifier in EMOF keyed to eventID

Regardless of where this ends up, I think we should be clear on how we would apply it to the two experimental scenarios y'all brought up (settlement plates and image annotation) and a common water sampling scenario of capturing eDNA and chemistry with a CTD Rosette, since these additional terms will be most helpful if we have ready examples to share with users.

Some examples of different uses of the term 'replicate', if you're interested.

1. [This paper](https://www.nature.com/articles/nmeth.3091) reminds me of my introduction to the subject in university. Essentially, there are 'biological' and 'technical' replicates. [LI-COR](https://www.licor.com/bio/blog/technical-and-biological-replicates) and [Illumina](https://www.illumina.com/Documents/products/technotes/technote_power_replicates.pdf), two instrument makers, echo this (see links). 2. [this paper](https://www.nature.com/articles/s41467-021-21038-1.pdf#:~:text=Observations%20nested%20within%20an%20experimental%20unit%20are%20referred,%E2%80%9C%20where%20replicates%20are%20not%20statistically%20independent%204.) says, > Observations nested within an experimental unit are referred to as subsamples, technical replicates, or pseudoreplicates. 3. [This paper](https://link.springer.com/article/10.1007/s00227-023-04205-4) refers to the subsamples of the grabbed sediment as 'technical replicates' and each all of the grabs at each station as 'replicate grabs'. 4. [This paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0179443) used, 'extraction replicates', 'PCR replicates' and 'spatial replicates' to describe both technical and statistical structure. 5. The water scientists at my agency, [USGS, describe replicates](https://pubs.usgs.gov/tm/04/c04/pdf/tm4c4.pdf) as, "_Replicates are two or more water samples that are collected, prepared, and analyzed such that they are considered to be essentially identical in composition and analysis._" But then go on to break them down temporally (concurrent or sequential replicates) and subsampled (split replicates). They also note that other agencies use 'subsample' replicates and 'co-located replicates' instead of 'split'. Finally, they coin a term I have never heard before, 'irreplicates' (which sound like technical replicates to me). Later in this same paper, they use split, concurrent, and sequential, as sub-types of 'field replicates', without defining field replicate. 6. [This paper](https://academic.oup.com/icesjms/article/73/3/572/2458712) analyses the replication metadata that is missing in many ocean acidification studies, and how that hinders downstream re-use.

kmexter commented 10 months ago

Hi Sorry for the delay in answering - I was giving a course last week and took some time off to recover.

Now, first let me say that I am not an ontologist, nor a biologist, I am a data manager and I want to have our data in EurOBIS be fully FAIR. In our EMOF we want to be able to say that "this occurrence comes from a material sample that has 2 other ones taken in the same field" or "this occurrence is from a sample that was taken from the same original test tube as this other sample".

So occurrenceID

ARMS_BelgiumCoast_AZFPin_20180712_20180818_MF500_ETOH:ASV1408_03663ed7f0d0e6a85e2a6af2c90d34d03af90641 comes from the material sample with the ID ARMS_BelgiumCoast_AZFPin_20180712_20180818_MF500_ETOH

which has the field ID (the name of the settlement unit these marine organisms were taken from) that is ARMS_BelgiumCoast_AZFPin

Then ARMS_BelgiumCoast_AZFPout_20180712_20180818_MF500_ETOH:ASV1408_03663ed7f0d0e6a85e2a6af2c90d34d03af90641 comes from the material sample with the ID ARMS_BelgiumCoast_AZFPout_20180712_20180818_MF500_ETOH with the field ID ARMS_BelgiumCoast_AZFPout

Now, ARMS_BelgiumCoast_AZFPin and ARMS_BelgiumCoast_AZFPout are units placed very close to each other, that makes them field replicates. When analysing the species distributions vs unit, it will be interesting to know that these two units are effectively sampling the same field. You can figure that out from the lat, long, but it is nice for us to provide the data user with this information by stating that fact in the EMOF. So in this case, we would say that our fieldReplicateID is ARMS_BelgiumCoast_AZFP for both of those occurrences.

You say that you don't think this will work, but I don't understand the reasoning why: it seems clear to me that if in the EMOF we link the occurrenceIDs to these fieldReplicateIDs, then we are telling the data user that all those occurrenceIDs with the same fieldReplicateIDs are coming from the same "field" (as we define a "field").

That you cannot find a good definition for replicate I understand: I opened a kettle of worms when I asked the scientists for this definition, it seemed to me that a very simple, generic, non-specific, definition is absolutely impossible?!?! My take is that don't define it - let the project define what they mean by replicate. I don't care what science is being done to create these data, it is irrelevant to the definition of this particular term.

Reading your email "When I tried to find good definitions of replication to lean on, I struggled (see examples at the bottom). I think how people use 'replication' depends quite a bit on their scientific background and they type of work they do. The four most common modifying themes were, 'biological', 'technical', 'spatial', and 'temporal'."

well, our field replicate would be a spatial one, and our sample replicate could be a technical one, I would say. (or it can just be a subsample - this is apparently a word with very different definitions depending on the science, but it does describe what happens when you tip a bottle of 100ml into two bottles of 50ml, which is pretty much what we mean with sample replicate)
The reason we are asking for fieldReplicateID and sampleReplicateID is that in fact we have samples that will be both at the same time, so we thought it was good to have a similar term name. I would be OK with e.g. subSampleID for the sampleReplicateID, if that helps?

"I'm leaning with @rubenpp7https://github.com/rubenpp7 that a general term of replicationID might be more useful by allowing flexibility, where the value of the ID would include the type modifier: e.g. field_rep1, sample_rep1, ARMS_rep1, plate_rep1, water_bottle_rep1, PCR_rep1, etc.)."

* hmm, yes but no: it is less machine-friendly (developer-friendly) to embed extra meaning in a value (that here is an ID). The developer then has to first look for the replicationID, then search for a modifier (figuring out what modifier string has been used) and only then do they know if they have a field replicate or any other type of one. My focus is on machine-friendliness, not human-friendliness, you see, so if a term has a different meaning, then that should be in the term not the value.

I don't quite follow your P01 bit in your email, so unsure if I agree with it. I do like the suggestion - I can't remember if it was Gwen or Ruben - to have not fieldReplicate but fieldReplicateID and sampleReplicateID as the term name, so that it is clear that the value is an ID (that ideally should ocurr elsewhere in your dataset), not a boolean or string.

Your suggestion

parentEventID == identifier for the ARMS unit (with the caveat that it wouldn't have a corresponding eventID that describes the event)
eventID == identifier for the bulk fraction subsample resulting from this protocolhttps://naturalhistory.si.edu/sites/default/files/media/file/arms-9armsprocessingscraping.pdf that is the input for downstream eDNA analysis.
technical replicate == identifier in EMOF keyed to eventID

This is now one step beyond my ability to grasp DwC. We have ARMSunitIDs (ARMS_BelgiumCoast_AZFPout), eventIDs (ARMS_BelgiumCoast_AZFPout_20180712_20180818), and materialSampleIDs (ARMS_BelgiumCoast_AZFPout_20180712_20180818_MF500_ETOH) that all need to be associated with an occurrenceID. In the real world the ARMS units are not parents to the events, but more the events are the parents of the arms units. And we have to use occurrence core as these are DNA-based occurrences.

Perhaps faster to talk rather than type? cheers k

[cid:20f0d93c-cdce-4af7-907d-1997cbfa20a1]

From: Stephen Formel @.> Sent: 26 January 2024 15:29 To: nvs-vocabs/P01 @.> Cc: Katrina Exter @.>; Mention @.> Subject: Re: [nvs-vocabs/P01] request for replicate term (BODCNVS-1790) (Issue #207)

Hi @kmexterhttps://github.com/kmexter, I'm the node manager for OBIS-USA, Chris Meyer shared this with me. Thanks for taking the reins on a complicated topic!

Unfortunately, I'm not sure if these terms will do much good for your eDNA data yet. Like @rubenpp7https://github.com/rubenpp7 describes, the DwC event hierarchy is generally how we (OBIS-USA/GBIF-US) advise trying to capture nested sampling structures with hierarchical events. But as far as I know, the Occurrence Core, which is what the DNA derived data extension requires, doesn't allow for description of nested events (or at least it won't pass validation in the IPT without an occurrenceID in each row). So, if I'm understanding everything above, I believe you would only be able to link the EMOF to sampleReplicate (the event during which your ASVs were observed), not fieldReplicate. Folks are aware of this limitation and are working on implementing a new data modelhttps://www.gbif.org/composition/HjlTr705BctcnaZkcjRJq/gbif-new-data-model.

That being said, it makes sense that you're working on this, these terms can apply for other ARMS data types, and I think these terms will be helpful to the community and eventually be useful for the eDNA too.

How to describe

My takeaway from the above discussion is that the two pieces of information people should be able to derive from these identifiers are (1) the experimental structure, and (2) the statistical dependencies of the data that are anchored to these identifiers.

When I tried to find good definitions of replication to lean on, I struggled (see examples at the bottom). I think how people use 'replication' depends quite a bit on their scientific background and they type of work they do. The four most common modifying themes were, 'biological', 'technical', 'spatial', and 'temporal'.

I'm leaning with @rubenpp7https://github.com/rubenpp7 that a general term of replicationID might be more useful by allowing flexibility, where the value of the ID would include the type modifier: e.g. field_rep1, sample_rep1, ARMS_rep1, plate_rep1, water_bottle_rep1, PCR_rep1, etc.).

My apologies if I've misunderstood any of the above conversation! Here are some suggested revisions, based on how I understand it:

P01 codes in the form: Identifier of replicate, where Identifierhttp://vocab.nerc.ac.uk/collection/S06/current/S0600156/ comes from S06 and "replicate" is a new term in S18/S29 with proposed definition: "independently collected samples, specifically collected to estimate information about a statistical population."

The second P01 would have the following combined preferred label: Identifier of technical replicate, where Identifierhttp://vocab.nerc.ac.uk/collection/S06/current/S0600156/ comes from S06 and "technical replicate" is a new term in S18/S29 with proposed definition: "subsample, or repeated processing, of a replicate, where replicate comes from S18/S29, to capture variation in handling of each replicate."

Would this work for ARMS?

I'm unsure because the ARMS protocol websitehttps://naturalhistory.si.edu/research/global-arms-program/protocols describes sampling the entire unit for some protocols, and the individual plates for others. But here is one possibility:

Plate Scraping for eDNA, published with Occurrence Core, DNA extension, and EMOF:

parentEventID == identifier for the ARMS unit (with the caveat that it wouldn't have a corresponding eventID that describes the event)
eventID == identifier for the bulk fraction subsample resulting from this protocolhttps://naturalhistory.si.edu/sites/default/files/media/file/arms-9armsprocessingscraping.pdf that is the input for downstream eDNA analysis.
technical replicate == identifier in EMOF keyed to eventID

Regardless of where this ends up, I think we should be clear on how we would apply it to the two experimental scenarios y'all brought up (settlement plates and image annotation) and a common water sampling scenario of capturing eDNA and chemistry with a CTD Rosette, since these additional terms will be most helpful if we have ready examples to share with users.

Some examples of different uses of the term 'replicate', if you're interested.

This paperhttps://www.nature.com/articles/nmeth.3091 reminds me of my introduction to the subject in university. Essentially, there are 'biological' and 'technical' replicates. LI-CORhttps://www.licor.com/bio/blog/technical-and-biological-replicates and Illuminahttps://www.illumina.com/Documents/products/technotes/technote_power_replicates.pdf, two instrument makers, echo this (see links).
this paperhttps://www.nature.com/articles/s41467-021-21038-1.pdf#:~:text=Observations%20nested%20within%20an%20experimental%20unit%20are%20referred,%E2%80%9C%20where%20replicates%20are%20not%20statistically%20independent%204. says,

Observations nested within an experimental unit are referred to as subsamples, technical replicates, or pseudoreplicates.

This paperhttps://link.springer.com/article/10.1007/s00227-023-04205-4 refers to the subsamples of the grabbed sediment as 'technical replicates' and each all of the grabs at each station as 'replicate grabs'.
This paperhttps://journals.plos.org/plosone/article?id=10.1371/journal.pone.0179443 used, 'extraction replicates', 'PCR replicates' and 'spatial replicates' to describe both technical and statistical structure.
The water scientists at my agency, USGS, describe replicateshttps://pubs.usgs.gov/tm/04/c04/pdf/tm4c4.pdf as, "Replicates are two or more water samples that are collected, prepared, and analyzed such that they are considered to be essentially identical in composition and analysis." But then go on to break them down temporally (concurrent or sequential replicates) and subsampled (split replicates). They also note that other agencies use 'subsample' replicates and 'co-located replicates' instead of 'split'. Finally, they coin a term I have never heard before, 'irreplicates' (which sound like technical replicates to me). Later in this same paper, they use split, concurrent, and sequential, as sub-types of 'field replicates', without defining field replicate.
This paperhttps://academic.oup.com/icesjms/article/73/3/572/2458712 analyses the replication metadata that is missing in many ocean acidification studies, and how that hinders downstream re-use.

— Reply to this email directly, view it on GitHubhttps://github.com/nvs-vocabs/P01/issues/207#issuecomment-1912157580, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJMCGERE3PPA455GSWUDDLDYQO4WNAVCNFSM6AAAAAAXZYWVROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGE2TONJYGA. You are receiving this because you were mentioned.Message ID: @.***>

sformel-usgs commented 10 months ago

@kmexter thank for the thoughtful response. I agree that some facetime might save some typing. Would you mind sending me an email (sformel@usgs.gov) to set up a meeting?

Thanks for breaking down the structure of the data a little more. I think the assumption I was making is that you wanted to link occurrenceID to a deeper list of nested eventIDs that captured the library preparation and sequencing workflow. Something like:

ARMS sampler > material sample > DNA extract > sequencing library > occurrenceID

I find that many scientists are interested in this level of granularity so that that can also treat individuals extracts as a material sample and link multiple sequencing libraries to the extraction event (e.g. 16S, 18S, etc.)

However, if I'm understanding your description, the occurrenceID would link to the eventID that represents the sampling material, so the structure would be:

ARMS sampler > material sample > occurrenceID

I still think this is problematic for publishing because you can't publish any additional context about the parenteventID with occurrence core. Each row (occurrence) would include an eventID and a parenteventID, but you can't publish an additional row describing the parent event, because that row wouldn't include any occurrenceID. This might be better explained in person. Here is an example of what I'm trying to explain:

valid Event Core

parenteventID	eventID	someOtherTerms
	site_1	some context
site_1	ARMS_unit_1	some context
ARMS_unit_1	plate_1	some context
plate_1	subsample_1	some context

valid Occurrence Core

parentEventID	eventID	occurrenceID	someOtherTerms
ARMS_unit_1	materialSample_1	occurrence_1.1	some context
ARMS_unit_1	materialSample_1	occurrence_1.2	some context
ARMS_unit_2	materialSample_2	occurrence_2.1	some context

But this isn't valid occurrence core:

parentEventID	eventID	occurrenceID	someOtherTerms
	ARMS_unit_1		some context

Maybe that's not a problem for your work, but I wanted to make sure it's clear since this is a confusing aspect of publishing eDNA data with Darwin Core.

naming of the terms

I think we're more or less on the same page about the names of the terms. I was trying to generalize the terms for all studies, which is why I was avoiding the word 'field'. For me it's a bit of a loaded term because it's not important that it's in the field, just that it's an independent replicate and not a technical replicate/subsample. However, I understand how it would give your group additional context and I'll defer to the experience of the vocab folks on this one.

sformel-usgs commented 9 months ago

I'm a bit tardy reporting this, but @kmexter and I spoke and agreed on two scenarios that could work. We're curious what the vocab people think.

Scenario 1

Two terms:

replicateID: identifier for a replicate, using a definition like what was discussed above.
replicateType: literal value that can be a controlled list customized to a project's needs. Describes the scope of the replicates, e.g. biological, spatial, technical, etc.

Scenario 2

Four identifier terms, predefining replicate types:

biologicalReplicateID
spatialReplicateID
temporalReplicateID
technicalReplicateID

potentially could have more flavors of these terms, although we recognize this puts more burden on the vocab maintainers.

Thoughts?

kmexter commented 8 months ago

FYI I like both approaches, whichever works best for BODC is acceptable to me.

gwemon commented 8 months ago

Thank you @kmexter @sformel-usgs @rubenpp7 for all the very valuable information. We will consider your suggestions and come back to you shortly.

gwemon commented 8 months ago

@sformel-usgs @kmexter @rubenpp7 Many many thanks for all your comments. Here are my first thoughts about the proposed approaches:

Scenario 1

only needs the creation of 2 P01 codes: one for Identifier of replicate and one for Type of replicate
will only work if we are certain that we will never need more than one replicate ID per eMOF record;
requires the creation of a new vocabulary collection for replicateType so that these can be clearly defined and harmonised

Scenario 2

requires the creation of as many P01 codes as we have replicate types with the possibility of creating one of "unknown type" if this was required for legacy purpose.
incites user to define unambiguously the kind of replicate the data are attached to
the new terms are defined in an existing collection (as extension to S18/S29) that has multiple usage: as a component of a P01 term or as a stand alone reference vocabulary.

Personally I would prefer scenario 2 because it is easier to manage, it just requires extension to existing vocabulary collections, and you get more for your effort. It is also more robust because there is less risk of data providers omitting to declare the replicate type.

It would be good to test both scenarios in a number of protocol setups. I'd be curious to know if this would also transfer to other non-biological sampling protocols.

Also it would be good to come up with clear definions for the following terms: biological replicate spatial replicate temporal replicate technical replicate

kmexter commented 8 months ago

Option 2 sounds good to me. I am probably not the best person to define these replicate types, but based on what has been discussed before in the ARMS group, I would suggest

spatial replicate ID: "Parent identifier for samples collected independently over some area, specifically collected to estimate statistical information about the organisms and/or conditions over that area."

technical replicate ID: "Parent identifier for subsamples, or repeated processings, of a parent sample, to capture variation in handling of each replicate, or simply to repeat the processing." Note: that this effectively excludes the case where you have split a sample in two, processed one and stored the other, then taken out the other to do something differerent to it. I think that is acceptable. Note: some people may prefer to call this sampleReplicate?

temporal replicate ID: "Parent identifier for samples collected independently over a timeframe, specifically collected to estimate changes to the organisms and/or conditions over that timeframe." Note: I am not so sure about the need for this one right now, at least WE don't need it

biologicalReplicate: hmm, dunno. will need to keep this different to technical replicate. @sformel-usgs has an opinion? @rubenpp7? Or don't do this one until someone asks for it?

cpavloud commented 8 months ago

Hi everyone,

Coming a bit late to the official discussion....

Biological replicate could be something like this biologicalReplicate: "Parent identifier for biologically distinct samples that are collected in parallel to capture random biological variation."

Also, spatial replicate could be spatial replicate ID: "Parent identifier for samples collected independently over some area, specifically collected to estimate statistical information about the presence, spatial distribution of organisms and/or conditions over that area."

sformel-usgs commented 8 months ago

I like the definition for biologicalReplicate suggested by @cpavloud. I agree with @gwemon that it's a good idea for these to work for non-biological data too. I'm fine with the exclusion noted by @kmexter for technical replicate. So, my suggested revisions are:

spatial replicate ID: Parent identifier for samples collected independently over some area, specifically collected to estimate statistical information about that area.

technical replicate ID: Parent identifier for subsamples, or repeated processing, of a parent sample, to capture variation in handling of each replicate, or simply to repeat the processing.

temporal replicate ID: Parent identifier for samples collected independently over a timeframe, specifically collected to estimate changes to the subject of interest over that timeframe.

biological replicate ID: Parent identifier for biologically distinct samples that are collected in parallel to capture random biological variation.

gwemon commented 8 months ago

Thank you @kmexter @cpavloud @sformel-usgs I need to spend time thinking about these in real case scenario and that makes me think: It would be a good to have some examples for each of these. For example if one samples with a quadrat and take repeat sampling from within the quadrat and perform the same analysis would that make spatial replicate? technical replicate? or if the samples are biological observations (e.g. diseased leaves on seagrass) could that also make them biological replicates?

kmexter commented 8 months ago

Settlement units that are spatial replicates: we put 3 settlement plates down, 10m from each other, and collect them at the same time, and process them in the same way into 3 samples Our technical replicates: please, @cpavloud and @sformel-usgs correct me if wrong, but for each sample from one settlement plate: it is split into 3, they are all preserved, but 2 are put in the freezer and 1 is shipped to HQ to be turned into sequences (DNA)

cpavloud commented 8 months ago

Katrina provided a good example. For the biological replicates, it is easier to think in a different (lab) setting. For example, if you have two mice that are being fed the same food and you want to assess changes in their kidney, mouse 1 is a biological replicate of mouse 2 (and vice versa). If you take the kidney of mouse 1 and cut into 5 pieces, then each piece is a technical replicate of the other 4.

gwemon commented 8 months ago

Thank you @kmexter @cpavloud. I won't have time to review this before I go on leave tonight. I have set up the tickets to create the 4 replicate terms we need to build the P01 codes. If you are keen to have these created asap, I will ask @danibodc to review them with the vocab team on Monday and decide the way forward.

The idea is to use these 4 S29 terms to build the P01 as: Preferred label: Identifier of xxx where xxx are any of the 4 replicate types The labels you provided (i.e. temporalReplicateID, biologicalReplicateID, etc) could be used to populate the Altlabel field. Definitions will be as given above.

kmexter commented 8 months ago

Altlabel field is in which part of the DwC? I don't think we need this in the next few weeks, but it would be nice to have it in May.

roswri commented 8 months ago

Hi @kmexter, @gwemon is on leave so I'll try to help in the meantime, apologies if I have misunderstood the question. Altlabel/Alternative label is part of the P01 information and doesn't have a direct mapping to a Darwin Core element. Ideally, the P01 preferred label should be used for the DwC MeasurementType term, and the DwC MeasurementTypeID would be the URI for the P01 term. However, the P01 Altlabel can help to create a mapping from the terms used in your dataset to the P01 preferred label.

kmexter commented 8 months ago

thanks for the info

ahpanton commented 7 months ago

4 new P01 codes have been created and are available now:

Identifier of biological replicate

Identifier of technical replicate

Identifier of temporal replicate

Identifier of spatial replicate

sformel-usgs commented 7 months ago

Example of BODC P01 Replicate Terms being used in the Darwin Core 'Extended MeasurementOrFact' extension.

Thought I'd illustrate the above examples for any DwC users who end up here:

Examples from Katrina Exter:

3 settlement plates down, 10m from each other, and collect them at the same time, and process them in the same way into 3 samples

eventID	measurementValue	measurementType	measurementTypeID	measurementRemarks
site_1_settlement_plate_1	SP_site1	Identifier of spatial replicate	http://vocab.nerc.ac.uk/collection/P01/current/IDSPRE01/	This is the identifier used to group all the settlement plates for a single site.
site_1_settlement_plate_2	SP_site1	Identifier of spatial replicate	http://vocab.nerc.ac.uk/collection/P01/current/IDSPRE01/
site_1_settlement_plate_3	SP_site1	Identifier of spatial replicate	http://vocab.nerc.ac.uk/collection/P01/current/IDSPRE01/
settlement_plate_1_BF_1	bulk_fraction_SP1	Identifier of technical replicate	http://vocab.nerc.ac.uk/collection/P01/current/IDTCRE01/	This is the identifier used to group the collection material that is subset, with two replicates stored in the freezer and one sent out for DNA analysis.
settlement_plate_1_BF_2	bulk_fraction_SP1	Identifier of technical replicate	http://vocab.nerc.ac.uk/collection/P01/current/IDTCRE01/	e.g. In the -20C freezer.
settlement_plate_1_BF_3	bulk_fraction_SP1	Identifier of technical replicate	http://vocab.nerc.ac.uk/collection/P01/current/IDTCRE01/	e.g. sent out for sequencing.

Using these terms with DNA data might be hindered by the need to use Occurrence Core (which prevents you from nesting events infinitely). But the community is working on improving that aspect of the data model.

@kmexter @rubenpp7 did I misinterpret anything?

kmexter commented 7 months ago

This is great, thanks all! I plan to use this in our EMOF, and it would be similar to your example above, Stephen occurrenceID = blablabla measurementType = Identifier of spatial replicate measurementUnit = none measurementValue = ARMS_Koster_VH_200101_210101 measurementTypeID = http://vocab.nerc.ac.uk/collection/P01/current/IDSPRE01/ measurementUnitID = none measurementValueID = none Where the actual identifier for the arms unit that this occurrenceID came from is ARMS_Koster_VH1_200101_210101 (VH1 instead of VH, there being also a VH2 and VH3)

kmexter commented 7 months ago

ah...should I close this issue?

SLBlakeman commented 7 months ago

Hi @kmexter, we left it open to allow for any comments related to the new codes we generated, in case we got requests for amendments. If everyone is happy that we have what is needed (for now anyway), then please do close the ticket. Thanks.

sformel-usgs commented 2 months ago

@kmexter worth noting that the DNA extension now allows use of Event Core (https://github.com/gbif/rs.gbif.org/issues/136), using the same mechanism as extended Measurement Or Fact. This means you can now have nested events correspond to DNA metadata. It isn't implemented in OBIS/GBIF yet, but they're working on it. Anyway, I emphasized that challenge above, so hopefully this improvement makes life a bit easier.

kmexter commented 2 months ago

Hi, I don't fully understand that other issue, but if this is implemented with examples, I am sure I will be able to follow that. Thanks for the heads-up

nvs-vocabs / P01