Closed apeltzer closed 4 years ago
here's Sarek docs about our the TSV file input metadata: https://github.com/SciLifeLab/Sarek/blob/master/docs/INPUT.md
Okay, lets add BAM and BAI as well so we can start from that as well 👍
From my experience, I think we need to have 'grouping' columns for:
Does that make sense?
An alternative would be to replace point 2 with sequencer - and we concatenate post-AR FASTQs only if the grouped sample is assigned to e.g. NextSeq (where you have a single injection onto the flowcell), and if assigned to HiSeq then merging of BAMs happen after mapping.
All three points are possible.
If you want to have 1., simply add a different ID for each line - then each line is treat indepdently :-)
If you want to have 2, simply add the same ID for each line and multiple internal samples. These will then be merged together after mapping individually to have all indiivdual lanes in one sample :-)
Is also possible via a join
- Sarek does that as well :-)
ID LabID SeqType Organism Treatment R1 R2 Population (ISO3) Group BAM BAI
ABRAKADABRA JK1945 paired Human UDG+ path/to/R1 path/to/R2 GER LBK path/to/bam path/to/bai
ABRAKADABRA JK1946 single Human UDG- path/to/R1 GER LBK path/to/bam path/to/bai
HOKUSPOKUS JK1948 paired Human UDG0.5 path/to/R1 path/to/R2 TRY Anatolijan_Farmer path/to/bam path/to/bai
ID = Identifier, basically the sample identifier to which multiple LabIDs might belong to. Samples (see JK1945, JK1946) are for example a paired end sequencing sample and a single end sequencing sample that have been sequenced from the same bone sample "ABRAKADABRA" and can be merged together. If users want to keep this separate, they can specify different IDs - if they use the same ID, the samples are merged together after initial mapping etc pp.
LabID = Internal sample id, e.g. sequencing library id. Multiple LabIDs can belong to the same "bone" or similar sample.
SeqType = Paired or Single end
Organism = might help in choosing a genome once people use iGenomes or similar resources at some point. Might not be required atm.
Treatment = UDG treatment Type
R1 = Forward read path R2 = reverse Read path
Populatino ISO 3 = Population ISO3 Code, to choose for population genetics analysis (optional). Group = Arbitrary free text grouping information, can be used to group a set of samples in pop gen (optional) BAM = Path to BAM file, will ignore R1/R2 then and use this for popgen or downstream analysis from scratch (remap the file) BAI = index required for doing the same
What about this?
Sample_Name | Library_ID | Lane | SeqType | Sequencer | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM | BAI | Custom_Col_1 | Custom_Col_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample_1 | Sample_1_Lib_1 | 1 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_1 | 2 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_1 | 3 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_1 | 4 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_2 | 1 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_2 | 2 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_2 | 3 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_1 | Sample_1_Lib_2 | 4 | pairedend | nextseq | NA | double | full | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_2 | Sample_2_Lib_1 | 1 | pairedend | hiseq | NA | double | half | /path/to/ | /path/to/ | /path/to/ | /path/to/ | Population_1 | Group_1 |
Sample_2 | Sample_2_Lib_2 | 2 | singleend | hiseq | NA | double | half | /path/to/ | NA | /path/to/ | /path/to/ | Population_1 | Group_1 |
Where each row of the table is adapter removal'ed together (i.e. anything with the same Sample_ID, Library ID and Lane).
Then:
The justification of the splitting by machine is that a given library that is split across multiple lanes comes from a single injection (i.e. single pipetting from the library aliquot) and represents a single library 'sampling', whereas for HiSeq, each 'lane' is from a different injection (i.e. different pipetting of the library aliquot) thus represents different 'samplings' from the library aliquot.
However all final (collapsed/concatenated/merged) files and MultiQC report info's are named by the 'Sample_Name'.
Custom columns are defined as you like (e.g. the population/ISO3 stuff), which EAGER2 pick ups if they are required for specific modules
I don't see the usefulness for sequencer but other than that I'm fine with the proposal 👍
Custom Fields are fine too - but I'll have to define some that can be used by users and not allow arbitrary ones ;-)
As you already know I am a big advocate of design files as input because they are able to capture all of the information required for the pipeline. 😉 However, I think we should also maybe have a general discussion as to how to define the fields in tsv files. This is something that could become quite disparate across pipelines, and it would be good to try and keep the convention for standard fields/headings the same across as many pipelines as possible.
Heres one I am planning for the chipseq
pipeline:
https://github.com/drpatelh/chipseq/blob/master/docs/usage.md#--design
and the one I use for the atacseq
pipeline:
https://github.com/nf-core/atacseq/blob/master/docs/usage.md#--design
This works if you sequence the same library more than once because you can merge based on the replicate
column.
Also, I was wondering whether its absolutely necessary to have a .bai
column because these normally reside with the .bam
files anyway? Im assuming you are trying to account for this not being the case?
I agree that having this kind of streamlined is a good idea. The ID/group information is fine to be used, but the group here would be for adding some downstream metadata, e.g. which population group a sample belonged to. So we'd need that as an extra column and not like it's used in chipseq/atacseq. Still, the approach would be very similar and should work 👍
With respect to having lane IDs, I don't think that's necessary - if people want to keep these separately they can do that by having different IDs in the front.
@apeltzer not necessarily. I realise I wasn't clear before - the lane library/lane/sequencer information is important for DeDup.
Lets say one is doing deep sequencing of a sample. You have multiple extracts and libraries from the same sample, one of these is sequenced on a NextSeq, another is sequenced on one lane of a HiSeq, and another on a different lane. As the DNA is from a single sample/individual you want the final output to be named with that (as indicated in ID
in your original post).
However, to perform Dedup correctly (as intended i.e. as a PCR duplicate removal) - this should only be performed on a single library, not on different libraries/extracts of the same sample - because then this is not actually a PCR duplicate but a biological one.
Therefore, you need to indicate which libraries are OK to concatenate after adapterRemoval (with NextSeq), then you need to indicate which libraries (in this case lanes or Lab_ID
) to perform DeDup on. but how this is performed depends on what type of 'lane' your flowcell has.
Obviously one could argue if it's still duplicated it's not 'useful' but it can be useful if the non-PCR related duplicate could be derived from two related species (e.g. in metagenomics), which you may want to utilise for abundance calculations etc.
Thats a valid point @jfy133 . However, if you want to make the most of this information you would need to populate the relevant fields in the Read Group for that sample. The simplest solution is to map each "run" of the sample individually adding the correct Read Group, merge libraries from the same sample, and then to re-perform duplicate marking and removal: https://software.broadinstitute.org/gatk/documentation/article?id=6472
This will require quite a bit more to be implemented because the current set-up takes --reads
as input which I assume means that multiple fastq files from the same sample are merged.
I have a similar implementation for the atacseq
pipeline but I took a more crude approach where multiple runs of the same sample are given a unique read group ID, and subsequently merged.
I see. I guess maybe Alex can take over discussion with the technical details here - if I understand it aDNA work does a slightly unusual series of steps compared to modern DNA because we often have such low coverage (but I'm an archaeologist who accidentally fell into bioinformatics, so what do I know ;) - thus I also may be getting confused in terminology at points).
I don't think the read group information is necessarily that important for the standard aDNA tools in EAGER (like in DeDup - correct me if I'm wrong @apeltzer). But if we are looking for a standardised 'design file' system then maybe that should be developed separately, then we 'hack' around it for aDNA contexts?
Hi both,
there are quite some different steps for aDNA, also with respect to duplicate removal which we typically don't do with MarkDuplicates but utilize DeDup (my own tool) that takes both start and end positions into account (read Peltzer et al 2016 for details on that). The problem is indeed, that DeDup cannot take read groups into account at the moment - so merging all based on read groups into a single BAM and then dedup based on read group information there isn't possible at the moment. The only prospect I see there is to dedup first and then merge the subsequent BAM files in all cases - which will be correct for both hiSeq and NextSeq output in my opinion.
I would like to be as similar as possible to other design files but if there are some specific parts missing for aDNA, we'd have to add these - chipseq/atacseq might have to at some point too 👍
Ehh, but not if the same library (where both will contain the same PCR duplicates) was sequenced on a HiSeq and NextSeq separately, if I understand correctly?
However, I think we are also going into too much detail for a widely applicable template for all nf-core pipelines.
I think the discussion should be pushed somewhere else and more generic (or dynamic even) terms defined there.
Additional note: if we keep --reads
functionality with a dummy internal map creation, we should also revert to the the correct behaviour of using --paired_end
as a Boolean flag. I.e. if supplied run merging, if not supplied assume single end and don't run merging.
Pointed out by @drpatelh as it doesn't follow the common nf-core practise.
Note to self - tests to add to tsv-input
branch
Need to throw warning when trying to merge lanes or libraries that are a combination of PE and SE libraries that PE DeDuping will not be performed correctly!
Update: 2020-04-15
Lane and library merging now working correctly (grouping by all metadata categories but lane; and all but library ID respectively)
Configurations validated:
--run_convertbam
specifiedAdditional thoughts:
Thanks for all of your work on this! I'm a new eager user and have been testing this branch as I have some multi-lane samples I need to run. Wouldn't it make sense in most cases to leave the lane merging until after mapping for performance reasons? ie. AdapterRemoval -> Mapping -> LaneMerge -> DeDup/MarkDuplicates
I suppose an alternative may be to just bump up the resource label on the mapping steps, but I think it would offer less flexibility.
Hey Kevin,
Thanks for testing it out and looking a bit deeper. We appreciate the feedback and input.
There are no major reasons why, but I guess they would be
1) legacy reasons - EAGER 1 actually merged lanes prior read clipping. So we didn't want to change too much from that, but at the same time maximise the amount of sequencing QC checks that can be performed.
2) Based on our experience(s), even when mapping with all the lanes merged Vs separated, the efficiency improvement is very minor in the major of cases - at least for us where most libraries are sequenced 5-10 million reads. It also reduces the flooding of schedulers with thousands of jobs.
3) a lot of people in our field run downloaded public data from repositories that are already lane merged. So for a run as a whole, the efficiency improvement is also not much, as this lane information is lost and run merged anyway. So it is more consistent with that.
In terms of resources during mapping - if you are regularly going over the limits of the default parameters, we can investigate if others hit the same issue and we can consider bumping it. On the Otherhand, you can always make your own profile (see nf-core/configs) in which you can customise the resource settings for each process for yourself (and thus adapt to your own routine data).
Furthermore If you have strong reasons why lane merging should happen after mapping, or we get more reports that it is an issue, we can consider changing it. But for this we sort of need more user trying it out and that will come when this functionality is released and a paper published.
Cheers
Thanks James, that makes sense. I ended up just bumping up the resources on bwamem and with the built-in multithreading it performed well. I've been running everything on Google Life Sciences and thought I might end up IO-bound, but even with 32 vCPUs I was able to utilize them all.
My normal use cases are admittedly probably quite different than what the pipeline was intended for. I am often working with samples from unidentified human remains and other forensic samples. They are modern, but often quite degraded, heavily contaminated, and often low input with a fairly high duplicate rate. I am sometimes working with 1 billion or more reads across four lanes on a NovaSeq for a single sample. So for example, I just had DeDup run out of memory even after bumping it to 16G. :)
A lot of the tweaks I've had to make probably aren't generally applicable, but I have run into some issues with file staging using GLS (mainly with the optional poststat and r2 files). I've kind of hacked around it for now, but if I come up with an elegant solution I'll submit a PR in case it helps anyone else.
That would be highly appreciated - thanks for getting back with info to us on also running this on the Google Life Sciences API! We couldn't test this so far, so are quite glad it worked apparently quite fine? :-)
Note that we/you could also have specific configuration files for adjusting parameters of the pipelines with defaults for your specific infrastructure via nf-core/configs:
https://github.com/nf-core/configs/blob/master/pipeline/eager.config
(e.g. the institute that James works with and I have been working with are having their own set of defaults for the pipeline - you could also have your own ;-)).
Thanks James, that makes sense. I ended up just bumping up the resources on bwamem and with the built-in multithreading it performed well. I've been running everything on Google Life Sciences and thought I might end up IO-bound, but even with 32 vCPUs I was able to utilize them all.
My normal use cases are admittedly probably quite different than what the pipeline was intended for. I am often working with samples from unidentified human remains and other forensic samples. They are modern, but often quite degraded, heavily contaminated, and often low input with a fairly high duplicate rate. I am sometimes working with 1 billion or more reads across four lanes on a NovaSeq for a single sample. So for example, I just had DeDup run out of memory even after bumping it to 16G. :)
A lot of the tweaks I've had to make probably aren't generally applicable, but I have run into some issues with file staging using GLS (mainly with the optional poststat and r2 files). I've kind of hacked around it for now, but if I come up with an elegant solution I'll submit a PR in case it helps anyone else.
Like Alex said- that's really cool to hear it works on GLS. WE've not had any one testing on cloud systems yet.
That sounds great! Glad it (sort of) worked. Out of curiousity, why dd you have to dump DeDup, did the auto-resubmission not work? Or are you trying to avoid that for runtime/cost issues?
Also, what staging issues were you having with the poststatfiles? If there is anything we can quickly fix we can integrate this before release of 2.2 (with the TSV input).
But of course PRs always gratefully welcomed.
Maybe you could add a new issue called 'GLS compatibility' or something and we can continue discussion there to keep on topic here.
Thanks for pointing out the institution-specific config files @apeltzer. I'm fairly new to Nextflow and the nf-core pipelines so I've been modifying base.conf but I'll have to go back and clean things up by moving everything into my own config.
I'm still trying to use DeDup as I'd prefer it for the better handling of the merged reads, but I've been having some issues with speed on larger datasets. I'm running a BAM now with ~787M reads and after 45 minutes it's only treated about 1.6M. I'll open an issue in the DeDup repo to see if maybe there's anything that can be worked out there.
@jfy133 I'll open a separate issue now for the GLS compatibility!
Posted in the https://github.com/apeltzer/DeDup/issues/9 issue that it might make sense to have DeDup parallelized on a per-chromosome level before the next bigger release. That would make it feasible to split BAMs per Chromosome, then run on each chunk and remerge the BAMs again :-)
Came across another bug with the multi-lane processing.
May-08 20:25:43.067 [Actor Thread 130] ERROR nextflow.processor.TaskProcessor - Error executing process > 'multiqc (1)'
Caused by:
Process `multiqc` input file name collision -- There are multiple input files for each of the following file names: fastp/EX1_fastp.json
Looks to be due to library_id being used for the naming of the json output without referencing the lane.
https://github.com/nf-core/eager/blob/bc7c58efc72a2466cffd80c6802e6d6c3cee4d3f/main.nf#L892
There also seems to be something possibly non-deterministic that causes the lanemerge and fastqc_after_clipping to often re-run on resume. Haven't dug in too far yet though.
Came across another bug with the multi-lane processing.
May-08 20:25:43.067 [Actor Thread 130] ERROR nextflow.processor.TaskProcessor - Error executing process > 'multiqc (1)' Caused by: Process `multiqc` input file name collision -- There are multiple input files for each of the following file names: fastp/EX1_fastp.json
Looks to be due to library_id being used for the naming of the json output without referencing the lane.
https://github.com/nf-core/eager/blob/bc7c58efc72a2466cffd80c6802e6d6c3cee4d3f/main.nf#L892
Hopefully fixed in: https://github.com/nf-core/eager/commit/10693e765f2b04691cd64d96f01da4a1ebc30c32. Let me know if that works
There also seems to be something possibly non-deterministic that causes the lanemerge and fastqc_after_clipping to often re-run on resume. Haven't dug in too far yet though.
I've also noticed this occasionally in the past with FASTQC, but I could never identify the issue. If you find out I'd be grateful to know why this is!
@jfy133 Thanks! Haven't had a chance yet to try the fixes, but I think I may have tracked down the bug causing the issues on resume. When looking at the command.sh file for one of the new fastqc_after_clipping processes, it seems r1 and r2 have been swapped:
#!/bin/bash -euo pipefail
fastqc -t 2 -q EX1-noqc_S1_L005_R1_001.fastq.pG.fq.pe.pair2.truncated.gz EX1-noqc_S1_L005_R1_001.fastq.pG.fq.pe.pair1.truncated.gz
I believe this is due to the mix here, as the Nextflow documentation seems to say the ordering of the output is random.
https://github.com/nf-core/eager/blob/10693e765f2b04691cd64d96f01da4a1ebc30c32/main.nf#L1024
Seems like the easiest fix might be to just add a sort() to it[7] when assigning r1/r2?
Good catch! Something I wouldn't have caught as personally never skip merging. Appreciate the testing!
You're correct about the Nextflow Docs, not random per se, just async, once something completes it gets pushed to the next process straight away, it doesn't wait for any related 'pairs' (unless combined into a single object, e.g. a tuple). So yeah I guess random based on milisecond process differences...
Feel free to make a PR (into branch tsv-input)
, otherwise I'll try it out tomorrow.
@lordkev Added here, as you suggested. https://github.com/nf-core/eager/commit/e3477b4f8b6ad710c06694fdd193dadc124afe15
Could you have a look when you have time?
All done in https://github.com/nf-core/eager/pull/396
For any other problems, please make a new issue.
I'd like to have an optional way to input a custom design via a TSV formatted metadata sheet. The idea is, that this will be used in all cases. When you don't supply this TSV file in the future, it will be created based on your regular
--reads
input, of course, this means with missing values (thus skipping e.g. downstream popgen tools analysis).Idea for metadata, basically based on what Sarek does:
Comments ?