nf-core / eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline
https://nf-co.re/eager
MIT License
148 stars 82 forks source link

Metadata Sheet Discussion & Multilane merging #209

Closed apeltzer closed 4 years ago

apeltzer commented 5 years ago

I'd like to have an optional way to input a custom design via a TSV formatted metadata sheet. The idea is, that this will be used in all cases. When you don't supply this TSV file in the future, it will be created based on your regular --reads input, of course, this means with missing values (thus skipping e.g. downstream popgen tools analysis).

Idea for metadata, basically based on what Sarek does:

ID  LabID   SeqType Organism    Treatment    R1  R2  Population (ISO3)   Group   
ABRAKADABRA JK1945  paired  UDG+ path/to/R1  path/to/R2  GER LBK
ABRAKADABRA JK1946  single  UDG- path/to/R1      GER LBK
HOKUSPOKUS  JK1948  paired  UDG0.5 path/to/R1  path/to/R2  TRY Anatolijan_Farmer

Comments ?

maxulysse commented 5 years ago

here's Sarek docs about our the TSV file input metadata: https://github.com/SciLifeLab/Sarek/blob/master/docs/INPUT.md

apeltzer commented 5 years ago

Okay, lets add BAM and BAI as well so we can start from that as well 👍

jfy133 commented 5 years ago

From my experience, I think we need to have 'grouping' columns for:

  1. Grouping per read pair (merging read pairs together)
  2. Grouping per lane (concatenating lane-wise after adaptor removal, post-AR FASTQC running prior concatenation)
  3. Grouping per sample (concatenating mapped BAMs of the same sample that has multiple libraries)

Does that make sense?

jfy133 commented 5 years ago

An alternative would be to replace point 2 with sequencer - and we concatenate post-AR FASTQs only if the grouped sample is assigned to e.g. NextSeq (where you have a single injection onto the flowcell), and if assigned to HiSeq then merging of BAMs happen after mapping.

apeltzer commented 5 years ago

All three points are possible.

If you want to have 1., simply add a different ID for each line - then each line is treat indepdently :-) If you want to have 2, simply add the same ID for each line and multiple internal samples. These will then be merged together after mapping individually to have all indiivdual lanes in one sample :-) Is also possible via a join - Sarek does that as well :-)

apeltzer commented 5 years ago
ID  LabID   SeqType Organism    Treatment    R1  R2  Population (ISO3)   Group   BAM    BAI 
ABRAKADABRA JK1945  paired Human UDG+ path/to/R1  path/to/R2  GER LBK path/to/bam path/to/bai
ABRAKADABRA JK1946  single Human UDG- path/to/R1      GER LBK path/to/bam path/to/bai
HOKUSPOKUS  JK1948  paired Human UDG0.5 path/to/R1  path/to/R2  TRY Anatolijan_Farmer path/to/bam path/to/bai

ID = Identifier, basically the sample identifier to which multiple LabIDs might belong to. Samples (see JK1945, JK1946) are for example a paired end sequencing sample and a single end sequencing sample that have been sequenced from the same bone sample "ABRAKADABRA" and can be merged together. If users want to keep this separate, they can specify different IDs - if they use the same ID, the samples are merged together after initial mapping etc pp.

LabID = Internal sample id, e.g. sequencing library id. Multiple LabIDs can belong to the same "bone" or similar sample.

SeqType = Paired or Single end

Organism = might help in choosing a genome once people use iGenomes or similar resources at some point. Might not be required atm.

Treatment = UDG treatment Type

R1 = Forward read path R2 = reverse Read path

Populatino ISO 3 = Population ISO3 Code, to choose for population genetics analysis (optional). Group = Arbitrary free text grouping information, can be used to group a set of samples in pop gen (optional) BAM = Path to BAM file, will ignore R1/R2 then and use this for popgen or downstream analysis from scratch (remap the file) BAI = index required for doing the same

jfy133 commented 5 years ago

What about this?

Sample_Name Library_ID Lane SeqType Sequencer Organism Strandedness UDG_Treatment R1 R2 BAM BAI Custom_Col_1 Custom_Col_2
Sample_1 Sample_1_Lib_1 1 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_1 2 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_1 3 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_1 4 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_2 1 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_2 2 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_2 3 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_1 Sample_1_Lib_2 4 pairedend nextseq NA double full /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_2 Sample_2_Lib_1 1 pairedend hiseq NA double half /path/to/ /path/to/ /path/to/ /path/to/ Population_1 Group_1
Sample_2 Sample_2_Lib_2 2 singleend hiseq NA double half /path/to/ NA /path/to/ /path/to/ Population_1 Group_1

Where each row of the table is adapter removal'ed together (i.e. anything with the same Sample_ID, Library ID and Lane).

Then:

The justification of the splitting by machine is that a given library that is split across multiple lanes comes from a single injection (i.e. single pipetting from the library aliquot) and represents a single library 'sampling', whereas for HiSeq, each 'lane' is from a different injection (i.e. different pipetting of the library aliquot) thus represents different 'samplings' from the library aliquot.

However all final (collapsed/concatenated/merged) files and MultiQC report info's are named by the 'Sample_Name'.

Custom columns are defined as you like (e.g. the population/ISO3 stuff), which EAGER2 pick ups if they are required for specific modules

apeltzer commented 5 years ago

I don't see the usefulness for sequencer but other than that I'm fine with the proposal 👍

Custom Fields are fine too - but I'll have to define some that can be used by users and not allow arbitrary ones ;-)

drpatelh commented 5 years ago

As you already know I am a big advocate of design files as input because they are able to capture all of the information required for the pipeline. 😉 However, I think we should also maybe have a general discussion as to how to define the fields in tsv files. This is something that could become quite disparate across pipelines, and it would be good to try and keep the convention for standard fields/headings the same across as many pipelines as possible.

Heres one I am planning for the chipseq pipeline: https://github.com/drpatelh/chipseq/blob/master/docs/usage.md#--design

and the one I use for the atacseq pipeline: https://github.com/nf-core/atacseq/blob/master/docs/usage.md#--design

This works if you sequence the same library more than once because you can merge based on the replicate column.

Also, I was wondering whether its absolutely necessary to have a .bai column because these normally reside with the .bam files anyway? Im assuming you are trying to account for this not being the case?

apeltzer commented 5 years ago

I agree that having this kind of streamlined is a good idea. The ID/group information is fine to be used, but the group here would be for adding some downstream metadata, e.g. which population group a sample belonged to. So we'd need that as an extra column and not like it's used in chipseq/atacseq. Still, the approach would be very similar and should work 👍

With respect to having lane IDs, I don't think that's necessary - if people want to keep these separately they can do that by having different IDs in the front.

jfy133 commented 5 years ago

@apeltzer not necessarily. I realise I wasn't clear before - the lane library/lane/sequencer information is important for DeDup.

Lets say one is doing deep sequencing of a sample. You have multiple extracts and libraries from the same sample, one of these is sequenced on a NextSeq, another is sequenced on one lane of a HiSeq, and another on a different lane. As the DNA is from a single sample/individual you want the final output to be named with that (as indicated in ID in your original post).

However, to perform Dedup correctly (as intended i.e. as a PCR duplicate removal) - this should only be performed on a single library, not on different libraries/extracts of the same sample - because then this is not actually a PCR duplicate but a biological one.

Therefore, you need to indicate which libraries are OK to concatenate after adapterRemoval (with NextSeq), then you need to indicate which libraries (in this case lanes or Lab_ID) to perform DeDup on. but how this is performed depends on what type of 'lane' your flowcell has.

Obviously one could argue if it's still duplicated it's not 'useful' but it can be useful if the non-PCR related duplicate could be derived from two related species (e.g. in metagenomics), which you may want to utilise for abundance calculations etc.

drpatelh commented 5 years ago

Thats a valid point @jfy133 . However, if you want to make the most of this information you would need to populate the relevant fields in the Read Group for that sample. The simplest solution is to map each "run" of the sample individually adding the correct Read Group, merge libraries from the same sample, and then to re-perform duplicate marking and removal: https://software.broadinstitute.org/gatk/documentation/article?id=6472

This will require quite a bit more to be implemented because the current set-up takes --reads as input which I assume means that multiple fastq files from the same sample are merged.

I have a similar implementation for the atacseq pipeline but I took a more crude approach where multiple runs of the same sample are given a unique read group ID, and subsequently merged.

jfy133 commented 5 years ago

I see. I guess maybe Alex can take over discussion with the technical details here - if I understand it aDNA work does a slightly unusual series of steps compared to modern DNA because we often have such low coverage (but I'm an archaeologist who accidentally fell into bioinformatics, so what do I know ;) - thus I also may be getting confused in terminology at points).

I don't think the read group information is necessarily that important for the standard aDNA tools in EAGER (like in DeDup - correct me if I'm wrong @apeltzer). But if we are looking for a standardised 'design file' system then maybe that should be developed separately, then we 'hack' around it for aDNA contexts?

apeltzer commented 5 years ago

Hi both,

there are quite some different steps for aDNA, also with respect to duplicate removal which we typically don't do with MarkDuplicates but utilize DeDup (my own tool) that takes both start and end positions into account (read Peltzer et al 2016 for details on that). The problem is indeed, that DeDup cannot take read groups into account at the moment - so merging all based on read groups into a single BAM and then dedup based on read group information there isn't possible at the moment. The only prospect I see there is to dedup first and then merge the subsequent BAM files in all cases - which will be correct for both hiSeq and NextSeq output in my opinion.

I would like to be as similar as possible to other design files but if there are some specific parts missing for aDNA, we'd have to add these - chipseq/atacseq might have to at some point too 👍

jfy133 commented 5 years ago

Ehh, but not if the same library (where both will contain the same PCR duplicates) was sequenced on a HiSeq and NextSeq separately, if I understand correctly?

However, I think we are also going into too much detail for a widely applicable template for all nf-core pipelines.

I think the discussion should be pushed somewhere else and more generic (or dynamic even) terms defined there.

jfy133 commented 4 years ago

Additional note: if we keep --reads functionality with a dummy internal map creation, we should also revert to the the correct behaviour of using --paired_end as a Boolean flag. I.e. if supplied run merging, if not supplied assume single end and don't run merging.

Pointed out by @drpatelh as it doesn't follow the common nf-core practise.

jfy133 commented 4 years ago

Note to self - tests to add to tsv-input branch

Need to throw warning when trying to merge lanes or libraries that are a combination of PE and SE libraries that PE DeDuping will not be performed correctly!

Update: 2020-04-15

Lane and library merging now working correctly (grouping by all metadata categories but lane; and all but library ID respectively)

Configurations validated:

Additional thoughts:

lordkev commented 4 years ago

Thanks for all of your work on this! I'm a new eager user and have been testing this branch as I have some multi-lane samples I need to run. Wouldn't it make sense in most cases to leave the lane merging until after mapping for performance reasons? ie. AdapterRemoval -> Mapping -> LaneMerge -> DeDup/MarkDuplicates

I suppose an alternative may be to just bump up the resource label on the mapping steps, but I think it would offer less flexibility.

jfy133 commented 4 years ago

Hey Kevin,

Thanks for testing it out and looking a bit deeper. We appreciate the feedback and input.

There are no major reasons why, but I guess they would be

1) legacy reasons - EAGER 1 actually merged lanes prior read clipping. So we didn't want to change too much from that, but at the same time maximise the amount of sequencing QC checks that can be performed.

2) Based on our experience(s), even when mapping with all the lanes merged Vs separated, the efficiency improvement is very minor in the major of cases - at least for us where most libraries are sequenced 5-10 million reads. It also reduces the flooding of schedulers with thousands of jobs.

3) a lot of people in our field run downloaded public data from repositories that are already lane merged. So for a run as a whole, the efficiency improvement is also not much, as this lane information is lost and run merged anyway. So it is more consistent with that.

In terms of resources during mapping - if you are regularly going over the limits of the default parameters, we can investigate if others hit the same issue and we can consider bumping it. On the Otherhand, you can always make your own profile (see nf-core/configs) in which you can customise the resource settings for each process for yourself (and thus adapt to your own routine data).

Furthermore If you have strong reasons why lane merging should happen after mapping, or we get more reports that it is an issue, we can consider changing it. But for this we sort of need more user trying it out and that will come when this functionality is released and a paper published.

Cheers

lordkev commented 4 years ago

Thanks James, that makes sense. I ended up just bumping up the resources on bwamem and with the built-in multithreading it performed well. I've been running everything on Google Life Sciences and thought I might end up IO-bound, but even with 32 vCPUs I was able to utilize them all.

My normal use cases are admittedly probably quite different than what the pipeline was intended for. I am often working with samples from unidentified human remains and other forensic samples. They are modern, but often quite degraded, heavily contaminated, and often low input with a fairly high duplicate rate. I am sometimes working with 1 billion or more reads across four lanes on a NovaSeq for a single sample. So for example, I just had DeDup run out of memory even after bumping it to 16G. :)

A lot of the tweaks I've had to make probably aren't generally applicable, but I have run into some issues with file staging using GLS (mainly with the optional poststat and r2 files). I've kind of hacked around it for now, but if I come up with an elegant solution I'll submit a PR in case it helps anyone else.

apeltzer commented 4 years ago

That would be highly appreciated - thanks for getting back with info to us on also running this on the Google Life Sciences API! We couldn't test this so far, so are quite glad it worked apparently quite fine? :-)

Note that we/you could also have specific configuration files for adjusting parameters of the pipelines with defaults for your specific infrastructure via nf-core/configs:

https://github.com/nf-core/configs/blob/master/pipeline/eager.config

(e.g. the institute that James works with and I have been working with are having their own set of defaults for the pipeline - you could also have your own ;-)).

jfy133 commented 4 years ago

Thanks James, that makes sense. I ended up just bumping up the resources on bwamem and with the built-in multithreading it performed well. I've been running everything on Google Life Sciences and thought I might end up IO-bound, but even with 32 vCPUs I was able to utilize them all.

My normal use cases are admittedly probably quite different than what the pipeline was intended for. I am often working with samples from unidentified human remains and other forensic samples. They are modern, but often quite degraded, heavily contaminated, and often low input with a fairly high duplicate rate. I am sometimes working with 1 billion or more reads across four lanes on a NovaSeq for a single sample. So for example, I just had DeDup run out of memory even after bumping it to 16G. :)

A lot of the tweaks I've had to make probably aren't generally applicable, but I have run into some issues with file staging using GLS (mainly with the optional poststat and r2 files). I've kind of hacked around it for now, but if I come up with an elegant solution I'll submit a PR in case it helps anyone else.

Like Alex said- that's really cool to hear it works on GLS. WE've not had any one testing on cloud systems yet.

That sounds great! Glad it (sort of) worked. Out of curiousity, why dd you have to dump DeDup, did the auto-resubmission not work? Or are you trying to avoid that for runtime/cost issues?

Also, what staging issues were you having with the poststatfiles? If there is anything we can quickly fix we can integrate this before release of 2.2 (with the TSV input).

But of course PRs always gratefully welcomed.

Maybe you could add a new issue called 'GLS compatibility' or something and we can continue discussion there to keep on topic here.

lordkev commented 4 years ago

Thanks for pointing out the institution-specific config files @apeltzer. I'm fairly new to Nextflow and the nf-core pipelines so I've been modifying base.conf but I'll have to go back and clean things up by moving everything into my own config.

I'm still trying to use DeDup as I'd prefer it for the better handling of the merged reads, but I've been having some issues with speed on larger datasets. I'm running a BAM now with ~787M reads and after 45 minutes it's only treated about 1.6M. I'll open an issue in the DeDup repo to see if maybe there's anything that can be worked out there.

@jfy133 I'll open a separate issue now for the GLS compatibility!

apeltzer commented 4 years ago

Posted in the https://github.com/apeltzer/DeDup/issues/9 issue that it might make sense to have DeDup parallelized on a per-chromosome level before the next bigger release. That would make it feasible to split BAMs per Chromosome, then run on each chunk and remerge the BAMs again :-)

30

lordkev commented 4 years ago

Came across another bug with the multi-lane processing.

May-08 20:25:43.067 [Actor Thread 130] ERROR nextflow.processor.TaskProcessor - Error executing process > 'multiqc (1)'

Caused by:
  Process `multiqc` input file name collision -- There are multiple input files for each of the following file names: fastp/EX1_fastp.json

Looks to be due to library_id being used for the naming of the json output without referencing the lane.

https://github.com/nf-core/eager/blob/bc7c58efc72a2466cffd80c6802e6d6c3cee4d3f/main.nf#L892

lordkev commented 4 years ago

There also seems to be something possibly non-deterministic that causes the lanemerge and fastqc_after_clipping to often re-run on resume. Haven't dug in too far yet though.

jfy133 commented 4 years ago

Came across another bug with the multi-lane processing.

May-08 20:25:43.067 [Actor Thread 130] ERROR nextflow.processor.TaskProcessor - Error executing process > 'multiqc (1)'

Caused by:
  Process `multiqc` input file name collision -- There are multiple input files for each of the following file names: fastp/EX1_fastp.json

Looks to be due to library_id being used for the naming of the json output without referencing the lane.

https://github.com/nf-core/eager/blob/bc7c58efc72a2466cffd80c6802e6d6c3cee4d3f/main.nf#L892

Hopefully fixed in: https://github.com/nf-core/eager/commit/10693e765f2b04691cd64d96f01da4a1ebc30c32. Let me know if that works

There also seems to be something possibly non-deterministic that causes the lanemerge and fastqc_after_clipping to often re-run on resume. Haven't dug in too far yet though.

I've also noticed this occasionally in the past with FASTQC, but I could never identify the issue. If you find out I'd be grateful to know why this is!

lordkev commented 4 years ago

@jfy133 Thanks! Haven't had a chance yet to try the fixes, but I think I may have tracked down the bug causing the issues on resume. When looking at the command.sh file for one of the new fastqc_after_clipping processes, it seems r1 and r2 have been swapped:

#!/bin/bash -euo pipefail
fastqc -t 2 -q EX1-noqc_S1_L005_R1_001.fastq.pG.fq.pe.pair2.truncated.gz EX1-noqc_S1_L005_R1_001.fastq.pG.fq.pe.pair1.truncated.gz

I believe this is due to the mix here, as the Nextflow documentation seems to say the ordering of the output is random.

https://github.com/nf-core/eager/blob/10693e765f2b04691cd64d96f01da4a1ebc30c32/main.nf#L1024

lordkev commented 4 years ago

Seems like the easiest fix might be to just add a sort() to it[7] when assigning r1/r2?

jfy133 commented 4 years ago

Good catch! Something I wouldn't have caught as personally never skip merging. Appreciate the testing!

You're correct about the Nextflow Docs, not random per se, just async, once something completes it gets pushed to the next process straight away, it doesn't wait for any related 'pairs' (unless combined into a single object, e.g. a tuple). So yeah I guess random based on milisecond process differences...

Feel free to make a PR (into branch tsv-input), otherwise I'll try it out tomorrow.

jfy133 commented 4 years ago

@lordkev Added here, as you suggested. https://github.com/nf-core/eager/commit/e3477b4f8b6ad710c06694fdd193dadc124afe15

Could you have a look when you have time?

jfy133 commented 4 years ago

All done in https://github.com/nf-core/eager/pull/396

For any other problems, please make a new issue.