nf-core / modules

Repository to host tool-specific module files for the Nextflow DSL2 community!
https://nf-co.re/modules
MIT License
264 stars 664 forks source link

new module: dragen #4026

Open marrip opened 9 months ago

marrip commented 9 months ago

Is there an existing module for this?

Is there an open PR for this?

Is there an open issue for this?

Are you going to work on this?

DNA

### Tasks
- [x] Provide ped for for [trio](https://github.com/genome-in-a-bottle/giab_data_indexes) @marrip
- [x] Run trio analysis on [HG002, HG003, HG004](https://github.com/genome-in-a-bottle/giab_data_indexes) @xuyangyuio
- [x] Add remaining output files to Dragen module @marrip

RNA

### Tasks
- [x] Check if we already have a list of output files @asr081
- [ ] Check if `.gtf` file is present in Dragen references folder @xuyangyuio
- [x] Find RNA samples @marrip
- [ ] Get access to dragen @marrip
- [ ] run test data @marrip

Methylation

### Tasks
- [x] Evaluate [test data](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR24827403&display=download) @asr081
- [ ] run test data @marrip

Amplicon

### Tasks

Single Cell

### Tasks
marrip commented 9 months ago

Options we might not need:

  --fastq-list arg                                        CSV file specifying list of FASTQs for input
  --fastq-list-sample-id arg                              Only process entries whose 'RGSM' entry matches the given Sample ID parameter (for fastq-list.csv input)
  --fastq-list-all-samples arg                            Process all samples in fastq-list file, even when there are multiple 'RGSM' (Sample ID) values
  --tumor-fastq-list arg                                  CSV file specifying list of tumor FASTQs for input
  --tumor-fastq-list-sample-id arg                        Only process entries in the tumor fastq list input whose 'RGSM' entry matches the given Sample ID parameter (for fastq-list.csv input)
  --tumor-fastq-list-all-samples arg                      Process all samples in tumor-fastq-list file, even when there are multiple 'RGSM' (Sample ID) values
  --variant-list arg                                      file specifying list of Variants for input
  --output-file-prefix arg                                Output filename prefix
  --run-info arg                                          Path to RunInfo.xml file (default root of BCL input
 directory)
marrip commented 9 months ago

Input:

  -1 [ --fastq-file1 ] arg                                FASTQ file to send to card (may be gzipped)
  -2 [ --fastq-file2 ] arg                                Second FASTQ file with paired-end reads (may be gzipped)
  --tumor-fastq1 arg                                      FASTQ file of tumor reads for somatic mode
  --tumor-fastq2 arg                                      Second FASTQ file of tumor reads for somatic mode
  -b [ --bam-input ] arg                                  Input BAM file to send to card
  --tumor-bam-input arg                                   Input BAM file of tumor reads for somatic mode
  --ml-recalibration-input-vcf arg                        A VCF or gVCF file containing small variant calls to recalibrate
  --bcl-input-directory arg                               Input BCL directory for BCL conversion (must be specified for BCL input)
  --sample-sheet arg                                      For BCL input, path to SampleSheet.csv file (default searched
 for in --bcl-input-directory)
  -a [ --annotation-file ] arg                            Transcript annotation file (RNA)
  --enable-rna-gene-fusion arg                            Enable the RNA gene fusion detection algorithm
  --rna-gf-input-file arg                                 Input chimeric junctions file, for standalone RNA gene fusion
  --rna-gf-restrict-genes arg                             Ignore genes with biotype other than protein coding or lncRNA for gene fusions
  --amplicon-target-bed arg                               The DNA amplicon target regions in bed format (required 4th column is amplicon name, optional 5th column is GeneID)
  --repeat-genotype-specs arg                             Repeat variant catalog file
marrip commented 9 months ago

params:

 -p [ --pair-by-name ] arg                               Whether to use read names to identify read pairs. Valid only for BAM input.
 --append-read-index-to-name arg                         Whether to append /1 or /2 to read names for paired-end
  --pair-suffix-delimiter arg                             Character that delimits paired-end suffixes, e.g. / for /1 and /2
  --aws-s3-region arg                                     Specify the geographical region of AWS S3 buckets
  --strip-input-qname-suffixes arg                        Whether to strip /1 or /2 from input read names
  --enable-vcf-compression arg                            Enable compression of VCF output files (Default=true)
  --RGID arg                                              Read group ID
  --RGLB arg                                              Read group library
  --RGPL arg                                              Read group sequencing technology
  --RGPU arg                                              Read group platform unit
  --RGSM arg                                              Read group sample name
  --RGCN arg                                              Read group sequencing center name
  --RGDS arg                                              Read group description
  --RGDT arg                                              Read group run date
  --RGPI arg                                              Read group predicted insert size
  --RGID-tumor arg                                        Read group ID for tumor input
  --RGLB-tumor arg                                        Read group library for tumor input
  --RGPL-tumor arg                                        Read group sequencing technology for tumor input
  --RGPU-tumor arg                                        Read group platform unit for tumpr input
  --RGSM-tumor arg                                        Read group sample name for tumor input
  --RGCN-tumor arg                                        Read group sequencing center name for tumor input
  --RGDS-tumor arg                                        Read group description for tumor input
  --RGDT-tumor arg                                        Read group run date for tumor input
  --RGPI-tumor arg                                        Read group predicted insert size for tumor input
  --prepend-filename-to-rgid arg                          Internally prepend the file name to the RGID tag in cases of having the same RGID for different read groups across multiple bams
  --bcl-only-lane arg                                     For BCL input, convert only specified lane number (default all lanes)
  --strict-mode arg                                       For BCL input, abort if any files are missing (false by default)
  --first-tile-only arg                                   For BCL conversion, only convert first tile of input (for testing & debugging)
  --tiles arg                                             For BCL conversion, process only a subset of tiles by a regular expression
  --exclude-tiles arg                                     For BCL conversion, exclude set of tiles by a regular expression
  --bcl-sampleproject-subdirectories arg                  For BCL conversion, output to subdirectories based upon sample sheet 'Sample_Project' column
  --sample-name-column-enabled arg                        Use sample sheet 'Sample_Name' column when naming fastq files & subdirectories
  --fastq-gzip-compression-level arg                      For BCL input, set fastq output compression level 0-9 (default 1)
  --shared-thread-odirect-output arg                      Use linux native asynchronous io (io_submit) for file output (Default=false)
  --bcl-num-parallel-tiles arg                            For pure BCL conversion to FASTQ, # of tiles to process in parallel (default 1)
  --bcl-num-conversion-threads arg                        For pure BCL conversion to FASTQ, # of threads for conversion (per tile, default # cpu threads)
  --bcl-num-compression-threads arg                       For pure BCL conversion to FASTQ, # of threads for fastq.gz output compression (per tile, default # cpu threads, or HW+12)
  --bcl-num-decompression-threads arg                     For pure BCL conversion to FASTQ, # of threads for bcl/cbcl input decompression (per tile, default half # cpu threads, or HW+8. Only applies when preloading files)
  --bcl-only-matched-reads arg                            For pure BCL conversion, do not output files for 'Undetermined' [unmatched] reads (output by default)
  --no-lane-splitting arg                                 For pure BCL conversion to FASTQ, do not split FASTQ file by lane (false by default)
  --num-unknown-barcodes-reported arg                     For pure BCL conversion to FASTQ, # of Top Unknown Barcodes to output (1000 by default)
  --bcl-validate-sample-sheet-only arg                    For BCL conversion, only validate RunInfo.xml & SampleSheet files
  --bcl-num-ora-compression-threads-per-file arg          # of threads for ora compression per file (default 10)
  --bcl-num-ora-compression-parallel-files arg            # of files to process in parallel for ora compression (default 6)
  --output-legacy-stats arg                               For BCL conversion, also output stats in legacy (bcl2fastq2) format (false by default)
  --no-sample-sheet arg                                   BCL: Enable legacy no-sample-sheet operation (No demux or trimming. No settings supported. False by default, not recommended
  --enable-map-align arg                                  Enable the mapper/aligner (Default=true)
  --enable-map-align-output arg                           Enable the output from mapper/aligner
  --enable-rna arg                                        Enable the mapper/aligner RNA pipeline
  --rna-gf-restrict-genes arg                             Ignore genes with biotype other than protein coding or lncRNA for gene fusions
  --enable-auto-multifile arg                             Import subsequent segments of *_001.fastq files (Default=true)
  --combine-samples-by-name arg                           Import all fastq files with same sample name as given file (even across lanes) (Default=false)
  --enable-bam-indexing arg                               Output a .bai index file along with the output .bam
  --enable-sort arg                                       Enable sorting after mapping/alignment   (Default=true)
  --enable-duplicate-marking arg                          Enable marking or removal of duplicate alignment records (Default=false)
  --remove-duplicates arg                                 Remove duplicates instead of marking them with flag 0x400 (Default=false)
  --fastq-offset arg                                      FASTQ quality offset value. Set to 33 or 64 (Default=33)
  --fastq-n-quality arg                                   FASTQ quality to output for N base calls
  --ref-sequence-filter arg                               Output only reads mapping to this reference sequence
  --generate-md-tags arg                                  Whether to generate MD tags for alignment output records
  --generate-zs-tags arg                                  Whether to generate ZS tags for alignment output records
  --generate-xq-tags arg                                  Whether to generate xq:i tags (extended MAPQ) for alignment output records
  --preserve-bqsr-tags arg                                If true, pass through BI/BD tags (default=true)
  --methylation-protocol arg                              Library protocol for methylation analysis. (none|directional|non-directional|directional-complement|pbat)
  --methylation-match-bismark arg                         When running methyl-seq analysis, try to match Bismark output
  --methylation-TAPS arg                                  Set to true if input assays are generated by TAPS, rather than typical bisulfite-conversion-based methylation assays.
  --methylation-keep-ref-cytosine arg                     Set to true to keep all reference cytosines in the CX_report, even if they don't appear in the input reads. (Default=False)
  --methylation-compress-cx-report arg                    Set to true to enable compression of the CX_report. (Default=False)
  --enable-methylation-calling arg                        If true, merge methyl-seq runs and add tags. If false, methyl-seq just writes a BAM file per aligner run
  --methylation-generate-cytosine-report arg              Whether to generate a genome-wide cytosine methylation report
  --methylation-generate-mbias-report arg                 Whether to generate a per-sequencer-cycle methylation bias report
  --methylation-reports-only arg                          Skip methylation analysis and generates reports. Requires dragen methylated BAM input
  --methylation-mapping-implementation arg                What implementation to use during methylation mapping. (single-pass|multi-pass)
  --preserve-map-align-order arg                          Preserve the order of mapper/aligner output to produce deterministic results.  Impacts performance
  --filter-flags-from-output arg                          Filter output alignments with any bits set in 'val' present in the flags field.  Hex & decimal values accepted
  --umi-library-type arg                                  Batch option for read collapsing [random-duplex, random-simplex, nonrandom-duplex, non-umi]
  --umi-enable arg                                        Enable UMI-based read processing
  --umi-min-supporting-reads arg                          Minimum number of supporting reads required for a family. Applied independently to read1 and read2
  --umi-emit-multiplicity arg                             Consensus read output type: both or duplex only or simplex only: [both, duplex, simplex], Default: both
  --enable-positional-collapsing arg                      Enable positional collapsing. (Default = false)
  --enable-pgx arg                                        Batch option for enabling all PGx callers (e.g. Star Allele, CYP2D6, CYP2B6). VC will be enabled.
  --enable-dna-amplicon arg                               Enable DNA amplicon mode for alignment and variant calling
(Default=false)
  --enable-rna-amplicon arg                               Enable RNA amplicon mode (Default=false)
  --repeat-genotype-enable arg                            Enable calling of repeat-expansion variants
marrip commented 9 months ago

uncertain:

  -r [ --ref-dir ] arg                                    Directory with reference and hash tables
  -c [ --config-file ] arg                                Configuration file
marrip commented 9 months ago

license stuff:

  --sse-key arg                                           Set server-side encryption [AES256]
  --lic-server arg                                        set license server for cloud sites: http://<base64_user>:<base64_pass>@<path>
  --lic-credentials arg                                   License configuration file.
  --lic-instance-id-location arg                          set cloud instance ID location
marrip commented 9 months ago

running docker run -v /var/run/docker.sock:/var/run/docker.sock --rm alpine/dfimage -sV=1.36 etycksen/dragen4:4.2.4

CMD ["/bin/bash"]
RUN RUN yum install unzip wget -y # buildkit
RUN RUN yum install which -y # buildkit
RUN RUN yum config-manager --enable ol8_codeready_builder # buildkit
RUN RUN yum install oracle-epel-release-el8 -y # buildkit
RUN RUN yum install git -y # buildkit
RUN RUN yum install perl -y # buildkit
RUN RUN yum install R -y # buildkit
RUN RUN yum install bc dkms gdb rsync smartmontools sos time -y # buildkit
RUN RUN yum install kernel kernel-devel -y # buildkit
RUN RUN yum install hostname -y # buildkit
COPY ./uname.sh /usr/bin/uname # buildkit
        usr/
        usr/bin/
        usr/bin/uname

COPY dragen-4.2.4-9.el8.x86_64.run . # buildkit
        dragen-4.2.4-9.el8.x86_64.run

RUN RUN /bin/sh dragen-4.2.4-9.el8.x86_64.run; rm -rf dragen-4.2.4-9.el8.x86_64.run  \
        && rm -rf /dragen_software # buildkit