NTR: abundance of sequence reads [BODCNVS-2107]

pieterprovoost commented 8 months ago

Describe the parameter code you need

Metabarcoding datasets usually come with relative abundances in terms of sequence reads per ASV or OTU. This is a request for terms for:

Total abundance of sequence reads in a PCR product
Abudance of sequence reads belonging to a specific ASV or OTU in a PCR product

What would be its expected units of measure or its vector dimensions

Dimensionless

What property kind could be used to define the type of measurement this relates to

Abundance

What is the primary object of interest (i.e. a chemical, biological or physical entity)

Sequence reads

Would we need to define a matrix? If so what should the matrix be?

Possibly PCR product.

gwemon commented 8 months ago

@pieterprovoost Thanks Pieter. A few quick comments as I am about to go on leave. The Vocab team will pick this up and if urgent will discuss with you how best to take this forward. Here are some of my thoughts. We have "Relative abundance" defined in S06 already: https://vocab.nerc.ac.uk/collection/S06/current/S0600020/ We also have some examples of similar kinds of quantities defined in P01 (but not yet transferred to the PUV semantic model). See: https://vocab.nerc.ac.uk/search_nvs/P01/?searchstr=Relative%20abundance&options=identifier,preflabel,altlabel,status_accepted&rbaddfilter=inc&searchstr2= They are very detailed with regards to the methodology but I wonder if we could inspire ourselves from them to create the more generic codes you need? If required we might need to define "Absolute abundance" for this kind of data in S06 to avoid the ambiguity of using the term "Abundance". I did a quick search and here are a few pointers: ASV stands for Amplicon Sequence Variant and OTU is Operational Taxonomic Units (see e.g. https://journals.asm.org/doi/10.1128/msphere.00191-21) They are effectively "things" that are being quantified in terms of number of sequence read in order to determine their absolute and/or relative abundance? Is "abundance" really the correct term? Could we maybe model these quantities as e.g. : Count of sequence read (total) by PCR analysis Count of sequence read (per Amplicon Sequence Variant specified elsewhere) by PCR analysis Count of sequence read (per Operational Taxonomic Unit specified elsewhere) by PCR analysis instead? At first I'd imagine we could define the sequence read (total) and sequence read (ASV/OTU) in our S29 vocabulary, defining sequence read (the "root" of the combined term) in S18. Adding "by PCR analysis" enables us to place this P01 term in context without having to use the "matrix" component which should be reserved for defining in what "environment" the measurememt was made.

danibodc commented 7 months ago

Hi @pieterprovoost have you had a chance to look into this? Many thanks!

kmexter commented 5 months ago

I was just myself looking for a term for this term, so I support this request. When I asked about what to call it, indeed our experts suggested "number of reads" rather than "abundance". I think it is good to be clear whether this is a total or a relative number - so good to have the word "total" in there.

gwemon commented 5 months ago

Thanks @kmexter. Joanna has also forwarded the ticket to a couple of colleagues. We will wait a week or two to hear from them. As a quick summary our suggestion is currently to have 3 new P01 modelled as (see long comment above for details): Count of sequence read (total) by PCR analysis Count of sequence read (per Amplicon Sequence Variant specified elsewhere) by PCR analysis Count of sequence read (per Operational Taxonomic Unit specified elsewhere) by PCR analysis

with: Count from S06 - exists already [sequence read] defined in S18 - new physical entity term [sequence read (total)] defined in S29 - new complex physical entity term [sequence read (per Amplicon Sequence Variant specified elsewhere)] defined in S29 - as above [sequence read (per Operational Taxonomic Unit specified elsewhere)] defined in S29 - as above polymerase chain reaction from S04 - exists already

If we go for this we will need to decide whether we need to remodel and align our legacy "Relative abundance of amplifiable DNA sequences" codes.

pieterprovoost commented 4 months ago

I just realized my description is not very clear. I agree that count is better, and in fact my intention for this term was the number reads from sequencing which is not the same as the total number of molecules in the PCR product of course. But we may also need a term for the number of copies in the sample from qPCR, and will create a new ticket for that if necessary. I'll check with @SSuominen1 to get a clearer description for this one.

LynnDelgat commented 4 months ago

Indeed having a term for the number of reads and the total number of reads will be very useful. Some comments from my side:

Agree to use "count" rather than abundance/relative abundance
I would not add the "by PCR analysis" as the ASV/OTU sequences and accompanying read counts are generated by a bioinformatic analysis performed on the raw sequence reads generated by sequencing
I think I wouldn't necessarily distinguish between an ASV or an OTU, to make the parameter more simple/universal (unless there would be some added value in splitting them that I'm unaware of)

So for example: Count of sequence reads (total) Count of sequence reads (per Operational Taxonomic Unit or Amplicon Sequence Variant specified elsewhere)

gwemon commented 4 months ago

Thank you @pieterprovoost and @LynnDelgat - It looks like we all agree on "Count". The risk with over simplifying the P01 description is that it may become ambiguous when taken out of context but I don't think it would apply here so yes, we can remove "by PCR". Regarding creating separate codes for ASV and OTU the main reason would be the possibility of them coexisting as two separate variables in a dataset. If this is not a option then okay to have them as "specified elsewhere". If there are a bundle of parameters likely to be needed to support common genomics datasets I think it would be good to have them as a list so that we can start seeing common ways to model and define the various elements of the parameters? But no problem if this list is not easy to compile at this stage. We can go ahead and create these codes if there is consensus.

gwemon commented 4 months ago

One question @LynnDelgat from somebody not used to this type of data: how would you specify elsewhere ASV/OTU sequences? Would that be specified in the Occurrence record of the DwCA format? how would people know that the count in the measurementValue field refers to an ASV or an OTU?

LynnDelgat commented 4 months ago

@gwemon The ASV/OTU sequences would be specified in the DNA extension (in the DNA_sequence field). They are linked 1:1 with occurrences, so occurrenceID should allow people to know which specific OTU/ASV (=> DNA_sequence) the count is for. How users would know if the sequence is an ASV/OTU is a bit complicated at the moment (there is no "sequence type" or similar field, though it could be deduced from a field that contains the OTU/ASV generation method, if it is filled in), but that's a separate issue from the counts parameters I would say.

gwemon commented 4 months ago

@LynnDelgat So until there is a field in the DwC format that caters for this information, this is an information that can only be extracted by knowledgeable human users if I understand correctly. if the plan is to create such a field to store this information then that's okay. Thank you.

gwemon commented 4 months ago

@LynnDelgat having read a bit more about ASVs and OTUs I agree with you about not distinguishing in the P01 code.

nvs-vocabs / P01

NTR: abundance of sequence reads [BODCNVS-2107] #251