poseidon-framework / poseidon3-schema

Schema of the Poseidon data format
MIT License
0 stars 1 forks source link

Metagenomics modules discussion #4

Open jfy133 opened 4 years ago

jfy133 commented 4 years ago

To talk about metagenomics integration into schema.

Field Type Options Definition
profiler string Can be any metagenomic profiler (i.e. taxonomic, functional etc), which is primarily defined by database
database string Name of database
database_desc sub_table
profile_tsv url
reads_assigned integer Number of reads assigned a category by the profiler
proportion_assigned double
command string Command (including parameters) used to generate profile
note string

Subtables

Header Field Type Options Definition
database_desc repository NCBI, JGI, GTDK
database_desc publication
database_desc taxonomic_schema
database_desc date
profiler_desc version

Request upstream fields (dnaSequencing or library level)

Field Type Options Definition
preindexing_qpcr_copies numeric NA Copy number from qPCR results pre-indexing of library. Used for OTU contaminant method when comparing blanks. Also proxy for aDNA yield.
jfy133 commented 4 years ago

This becomes a bit scary, but parameters as well? Might be overkill though in terms of complexity.

stschiff commented 4 years ago

I don't think we should require perfect reproducibility. Or, rather, if you want that, perhaps it's better then to link to a protocols.io recipe, or some nextflow pipe or something?

jfy133 commented 4 years ago

Yeah exactly. Something for us to consider the most efficient method. Maybe just including a note say check this article for parameters, would be sufficient.

Standardisation is something we want to push for though in our work as we are scaling up too, though which is why I'm bringing this up here. But might be for a different discussion elsewhere :👍

maxibor commented 4 years ago

One of the idea could be to run every sample through the same processing pipeline (nf-core/eager) since we wrote it to give both genomics and metagenomics results.

jfy133 commented 4 years ago

That is something we should definitely do. However it sounds like Pop-gen will be running the internal pipeline, so something to discuss with them.

However for the schema itself we should be careful that it's very generic that groups who don't use nf-core/eager can also add stuff. Also over time different profilers outside Kraken(and sort of MALT) can be used.

Could just have a field that needs to contain the command used for the taxonomic profiler?

Another question if we should restrict to taxonomic profiling or also allow any profiler e.g. humann2 and functional stuff?

alexhbnr commented 4 years ago

I think we can keep it generic because in the end all of these tools are profilers of some kind and only the reference database determines whether you have taxa, functional pathways etc. Therefore, if we keep it generic, we can save us some hassle in the future in case some new tool is released.

jfy133 commented 4 years ago

So minimum fields:

Any other useful metadata? Number of reads assigned, for example?

Another question is are we integrating this into 'sequencingDna' or a separate module? We will need to check phrasing that it doesn't overlap nad is distinct from the pop-gen fields.

I will make a table in the OP to keep track?

maxibor commented 4 years ago

Hmm, In curatedMetagenomicData, they chose to go only for Metaphlan2

alexhbnr commented 4 years ago

I think these minimum fields plus the number of assigned reads should be fine.

I would not put it into sequencingDna because the results of the profiling are related to 'sequencingDNA' but you could have multiple profiles per sequencing run, one for each profiler. I would only keep the information about the reads aligning to the host genome in sequenceDna.

Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.

jfy133 commented 4 years ago

Sure. We can still internally have our standards, and ship all our modules accordingly, but I don't think the Stephan's idea of this framework is to force particular standards on people. It is just meant to be a consistent way of reporting metadata.

jfy133 commented 4 years ago

I think these minimum fields plus the number of assigned reads should be fine.

I would not put it into sequencingDna because the results of the profiling are related to 'sequencingDNA' but you could have multiple profiles per sequencing run, one for each profiler. I would only keep the information about the reads aligning to the host genome in sequenceDna.

Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.

@stschiff how do you think this would then work with the current title of sequencingData?

maxibor commented 4 years ago

Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.

I know that's a bit restrictive, but if we allow for different profilers, we completely loose the possibility of doing any sort of meta-analysis...

alexhbnr commented 4 years ago

Not necessarily, we just need to record the profiler plus it's version and which public database we are using. curatedMetagenomicData includes MetaPhlAn2 profiles but if you install the most recent MetaPhlAn2 version you end up using a different database by default. So we would anyway have to record it somehow.

jfy133 commented 4 years ago

Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.

I know that's a bit restrictive, but if we allow for different profilers, we completely loose the possibility of doing any sort of meta-analysis...

We can ensure the raw sequencing read file is provided, and we can re-process stuff ourselves with nf-core EAGER for consistency. There is no reason why we can't have sample data from multiple labs with different profilers, we just host all our 'opinioned' versions of the JSONs ourselves. They would ultimately point to the same 'Individual' URI, but with a different 'publication source' to indicate we processed it differently.

Also, just to point out, if we publish hundreds of our samples in the same way/format, people will probably just copy anyway to make the datasets mergable with their'as in the fastest way ;)

jfy133 commented 4 years ago

Other possible metadata: qPCR values and batch ID to control for contamination (decontam) and batch effects (something we aren't doing at the moment).

Issue here is that this should be at the sequencing level, so would need to check with Stephan he's ok adding them there. We can specify in our personal packages of JSON that this field is required, and can be done with our own validation tool we can create for all metagenomics jsons, If wer think this should be required, that is.

alexhbnr commented 4 years ago

I think that would be a good idea, but should be incorporate in the Pandora-like table with information on the library prep, or will this not be included? Because qPCR results are library related not profiler related, from a SQL point of view. Am I missing here something?

jfy133 commented 4 years ago

No you're correct, it wouldn't be within the metagenomics module itself but one level up, in the same level as the double/single stranded and udg treatment metadata. These are entries in the JSON if you look under the 'dnaSequencing' schema

stschiff commented 4 years ago

phew... lots of discussion. Just some comments that I feel I can say something to now:

I would not put it into sequencingDna because the results of the profiling are related to 'sequencingDNA' but you could have multiple profiles per sequencing run, one for each profiler. I would only keep the information about the reads aligning to the host genome in sequenceDna.

Indeed, dnaSequencing is meant to report raw library stats + host_organism_mapping. So if you have human tissue, it may include mapping to human ref-genome, if you have dog-tissue it may include dog-genome mapping (or not, but not any other mapping). So I would suggest to put metagenomics info into their own table. I would also put any secondary mappings into a separate table, such as Pathogen.

I'm also willing to discuss separating sequencing from mapping, but at this point felt it would be overkill... happy to discuss though.

With respect to restrictive vs. general: I feel I'd like to have a discussion with you what sort of meta-analyses you envision with this. What Robert Forkel emphatically recommended is to not make this format as broad and generic as possible, because that will lead to people using it for too diverse things and in the end data won't be useful to have together. Perhaps we should have a (remote) meeting about this and you tell me what sort of analyses you envision. For the human side, this is quite clear, but I don't know enough about meta-genomics.

jfy133 commented 4 years ago

We can discuss the separation of sequencing from mapping a bit further, while It would be breaking up a bit your currently very nice 4 levels but maybe 'dnaSequencing' is indeed maybe a bit of a misnomer there, in the sense that dnaSequencing to me is like everything up to a raw FASTQ (info about the library, how many reads were physically sequenced). This would also work e.g. for proteins and stable isotopes (collagen extraction is the main experiment, then you can do either bulk protein analysis or single AA acid, or radiocarbon etc as different analyses, in a way)

And from there, the FASTQ can then go into any of: metagenomics, pathogen screening, host mapping - i.e. an analysis level. Of course at SHH we often do multiple of these at once (and nf-core/eager helps with this), but it is not necessarily standard.

It is a fair point about making it too broad, I guess the difference with pop-gen is there isn't yet a consensus in the field on a standard profiler etc (but maybe something to bring up at SPAAM2, @maxibor @alexhbnr ?). The data standard is there (literally a TSV), it's what is going into making the TSV is the issue for us.

Yes maybe a remote meeting would be good and you can advise us.

jfy133 commented 4 years ago

More #ShowerThoughts

I'm not sure how much sense including the quantification is now, given that there is no scope for including control-like 'samples' and it wouldn't necessarily fit in the structure Stephan is thinking about.

Could we not just store the OTU table as an actual CSV? It would only be two columns (out name and sample names; although quite a few rows), but would still be relatively lightweight.

Still can't work out a way to describe the database yet (maybe just a key based on the first paper it was published in?), But the parameters use for the classifier could be the literal command string but without the input/output parameters?