Open jfy133 opened 4 years ago
This becomes a bit scary, but parameters as well? Might be overkill though in terms of complexity.
I don't think we should require perfect reproducibility. Or, rather, if you want that, perhaps it's better then to link to a protocols.io recipe, or some nextflow pipe or something?
Yeah exactly. Something for us to consider the most efficient method. Maybe just including a note say check this article for parameters, would be sufficient.
Standardisation is something we want to push for though in our work as we are scaling up too, though which is why I'm bringing this up here. But might be for a different discussion elsewhere :👍
One of the idea could be to run every sample through the same processing pipeline (nf-core/eager) since we wrote it to give both genomics and metagenomics results.
That is something we should definitely do. However it sounds like Pop-gen will be running the internal pipeline, so something to discuss with them.
However for the schema itself we should be careful that it's very generic that groups who don't use nf-core/eager can also add stuff. Also over time different profilers outside Kraken(and sort of MALT) can be used.
Could just have a field that needs to contain the command used for the taxonomic profiler?
Another question if we should restrict to taxonomic profiling or also allow any profiler e.g. humann2 and functional stuff?
I think we can keep it generic because in the end all of these tools are profilers of some kind and only the reference database determines whether you have taxa, functional pathways etc. Therefore, if we keep it generic, we can save us some hassle in the future in case some new tool is released.
So minimum fields:
Any other useful metadata? Number of reads assigned, for example?
Another question is are we integrating this into 'sequencingDna' or a separate module? We will need to check phrasing that it doesn't overlap nad is distinct from the pop-gen fields.
I will make a table in the OP to keep track?
Hmm, In curatedMetagenomicData, they chose to go only for Metaphlan2
I think these minimum fields plus the number of assigned reads should be fine.
I would not put it into sequencingDna because the results of the profiling are related to 'sequencingDNA' but you could have multiple profiles per sequencing run, one for each profiler. I would only keep the information about the reads aligning to the host genome in sequenceDna.
Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.
Sure. We can still internally have our standards, and ship all our modules accordingly, but I don't think the Stephan's idea of this framework is to force particular standards on people. It is just meant to be a consistent way of reporting metadata.
I think these minimum fields plus the number of assigned reads should be fine.
I would not put it into sequencingDna because the results of the profiling are related to 'sequencingDNA' but you could have multiple profiles per sequencing run, one for each profiler. I would only keep the information about the reads aligning to the host genome in sequenceDna.
Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.
@stschiff how do you think this would then work with the current title of sequencingData
?
Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.
I know that's a bit restrictive, but if we allow for different profilers, we completely loose the possibility of doing any sort of meta-analysis...
Not necessarily, we just need to record the profiler plus it's version and which public database we are using. curatedMetagenomicData includes MetaPhlAn2 profiles but if you install the most recent MetaPhlAn2 version you end up using a different database by default. So we would anyway have to record it somehow.
Regarding curatedMetagenomicData, we don't have to restrict ourselves as long as we agreed on a standard database of the profiler.
I know that's a bit restrictive, but if we allow for different profilers, we completely loose the possibility of doing any sort of meta-analysis...
We can ensure the raw sequencing read file is provided, and we can re-process stuff ourselves with nf-core EAGER for consistency. There is no reason why we can't have sample data from multiple labs with different profilers, we just host all our 'opinioned' versions of the JSONs ourselves. They would ultimately point to the same 'Individual' URI, but with a different 'publication source' to indicate we processed it differently.
Also, just to point out, if we publish hundreds of our samples in the same way/format, people will probably just copy anyway to make the datasets mergable with their'as in the fastest way ;)
Other possible metadata: qPCR values and batch ID to control for contamination (decontam) and batch effects (something we aren't doing at the moment).
Issue here is that this should be at the sequencing level, so would need to check with Stephan he's ok adding them there. We can specify in our personal packages of JSON that this field is required, and can be done with our own validation tool we can create for all metagenomics jsons, If wer think this should be required, that is.
I think that would be a good idea, but should be incorporate in the Pandora-like table with information on the library prep, or will this not be included? Because qPCR results are library related not profiler related, from a SQL point of view. Am I missing here something?
No you're correct, it wouldn't be within the metagenomics module itself but one level up, in the same level as the double/single stranded and udg treatment metadata. These are entries in the JSON if you look under the 'dnaSequencing' schema
phew... lots of discussion. Just some comments that I feel I can say something to now:
I would not put it into sequencingDna because the results of the profiling are related to 'sequencingDNA' but you could have multiple profiles per sequencing run, one for each profiler. I would only keep the information about the reads aligning to the host genome in sequenceDna.
Indeed, dnaSequencing
is meant to report raw library stats + host_organism_mapping. So if you have human tissue, it may include mapping to human ref-genome, if you have dog-tissue it may include dog-genome mapping (or not, but not any other mapping). So I would suggest to put metagenomics info into their own table. I would also put any secondary mappings into a separate table, such as Pathogen.
I'm also willing to discuss separating sequencing from mapping, but at this point felt it would be overkill... happy to discuss though.
With respect to restrictive vs. general: I feel I'd like to have a discussion with you what sort of meta-analyses you envision with this. What Robert Forkel emphatically recommended is to not make this format as broad and generic as possible, because that will lead to people using it for too diverse things and in the end data won't be useful to have together. Perhaps we should have a (remote) meeting about this and you tell me what sort of analyses you envision. For the human side, this is quite clear, but I don't know enough about meta-genomics.
We can discuss the separation of sequencing from mapping a bit further, while It would be breaking up a bit your currently very nice 4 levels but maybe 'dnaSequencing' is indeed maybe a bit of a misnomer there, in the sense that dnaSequencing to me is like everything up to a raw FASTQ (info about the library, how many reads were physically sequenced). This would also work e.g. for proteins and stable isotopes (collagen extraction is the main experiment, then you can do either bulk protein analysis or single AA acid, or radiocarbon etc as different analyses, in a way)
And from there, the FASTQ can then go into any of: metagenomics, pathogen screening, host mapping - i.e. an analysis level. Of course at SHH we often do multiple of these at once (and nf-core/eager helps with this), but it is not necessarily standard.
It is a fair point about making it too broad, I guess the difference with pop-gen is there isn't yet a consensus in the field on a standard profiler etc (but maybe something to bring up at SPAAM2, @maxibor @alexhbnr ?). The data standard is there (literally a TSV), it's what is going into making the TSV is the issue for us.
Yes maybe a remote meeting would be good and you can advise us.
More #ShowerThoughts
I'm not sure how much sense including the quantification is now, given that there is no scope for including control-like 'samples' and it wouldn't necessarily fit in the structure Stephan is thinking about.
Could we not just store the OTU table as an actual CSV? It would only be two columns (out name and sample names; although quite a few rows), but would still be relatively lightweight.
Still can't work out a way to describe the database yet (maybe just a key based on the first paper it was published in?), But the parameters use for the classifier could be the literal command string but without the input/output parameters?
To talk about metagenomics integration into schema.
profiler
string
database
string
database_desc
sub_table
profile_tsv
url
reads_assigned
integer
proportion_assigned
double
command
string
note
string
Subtables
database_desc
database_desc
database_desc
database_desc
profiler_desc
version
Request upstream fields (dnaSequencing or library level)