mskcc / nf-fastq-plus

Generate IGO fastqs, bams, stats and fingerprinting
1 stars 0 forks source link

nf-fastq-plus

Generate IGO fastqs, bams, stats and fingerprinting

Run

There are two options for running the modules in this pipeline -

Links for Developers

Demultiplex and Stats

Description: Runs end-to-end pipeline of demultiplexing and stats. The input of this is the name of the sequencing run

# Basic
nextflow main.nf --run ${RUN}

# Skip demultiplexing
nextflow main.nf --run ${RUN} --force true

# Run demux and stats only on one request
nextflow main.nf --run ${RUN} --filter ${REUQEST_ID}

# Run in background
nohup nextflow main.nf --run ${RUN} --force true -bg

# Push pipeline updates to nf-dashboard
nohup nextflow main.nf --run ${RUN} --force true -with-weblog 'http://dlviigoweb1:4500/api/nextflow/receive-nextflow-event' -bg  

Arguments (--arg)

Options (-opt)

Stats Only

Description: Runs stats given a demultiplex output

# Basic
nextflow samplesheet_stats_main.nf --ss ${SAMPLE_SHEET} --dir ${DEMULTIPLEX_DIRECTORY} 

# Run stats only on one request
samplesheet_stats_main.nf --ss ${SAMPLE_SHEET} --dir ${DEMULTIPLEX_DIRECTORY}  --filter ${REUQEST_ID}

# Run in background
nohup nextflow samplesheet_stats_main.nf --ss ${SAMPLE_SHEET} --dir ${DEMULTIPLEX_DIRECTORY} -bg  

Arguments (--arg)

Options (-opt)

Re-running Pipeline

For Development

Please Read:

Project Structure

Adding a new workflow

Steps for Adding a New Module 1) Add module

├── modules
│   └── process.nf
process {PROCESS_NAME} {
  [ directives ]

  output:
  ...
  stdout()

  shell:
  template '{PROCESS_SCRIPT}'
}

2) Add template

└── templates
    └── process.sh

3) Emit PARAM file (Only if downstream processes are dependent on the output)

workflow wkflw { take: PARAMS INPUT

main: task( PARAMS, INPUT )

emit: PARAMS = task.out.PARAMS # Assign PARAMS so that it's available in the main.nf VALUE = task.out.VALUE }

* **Why?** Nextflow channels emit asynchronously. This means that upstream processes will emit and pass to the next 
available process and not necessarily the expected one. For instance, if process A emits parameters used by all 
downstream processes and process B emits the value that will be transformed by that parameter, process C will not 
necessarily receive the proccess A parameters that apply to value emited by process B because each process has an 
independent, asynchronous channel.

4) (Optional) Add logging

In the modules, convert the exported member to a workflow that calls an included `log_out` process to log everything 
sent to stdout by the process. See below,

include log_out as out from './log_out'

process task { output: stdout() // Add this to your outputs ...

shell: ''' echo "Hello World" // Example sent to STD OUT ... ''' }

workflow task_wkflw { // This is what will actually be exported main: task | out }


#### Logging
There are three files that log information - 
* `LOG_FILE`: All output is logged here (except commands)
* `CMD_FILE`: All stat commands are logged to this file
* `DEMUX_LOG_FILE`: All demultiplexing commands are logged here

### Testing 
Docker Container Actions run our integration tests on GitHub. To test changes, please build the dockerfile from the root
 and verify no errors are generated from the `samplesheet_stats_main_test_hwg.sh` and `cellranger_demux_stats.sh` 
 scripts.

docker image build -t nf-fastq-plus-playground .

Test stats-only workflow

docker run --entrypoint /nf-fastq-plus/testPipeline/e2e/samplesheet_stats_main_test_hwg.sh -v $(pwd)/../nf-fastq-plus:/nf-fastq-plus nf-fastq-plus-playground

Test e2e (demux & stats)

docker run --entrypoint /nf-fastq-plus/testPipeline/e2e/cellranger_demux_stats.sh -v $(pwd)/../nf-fastq-plus:/nf-fastq-plus nf-fastq-plus-playground


## Nextflow Config
Modify directory locations, binaries, etc. in the `nextflow.config` file

### Important Files

LOG_FILE # Logs all output from the pipeline CMD_FILE # Logs all commands from the pipeline (e.g. was bcl2fastq run w/ 1 or 0 mistmaches?) DEMUX_LOG_FILE # Logs output of bcl2fastq commands


### Important Directories

STATS_DIR # Where final BAMS are written to STATSDONEDIR # Where stat (.txt) files & cellranger ouptut is written to PROCESSED_SAMPLE_SHEET_DIR # Where split samplesheets go (these are used for demuxing and stats) LAB_SAMPLE_SHEET_DIR # Original source of samplesheets COPIED_SAMPLE_SHEET_DIR # Where original samplesheets are copied to CROSSCHECK_DIR # Directory used for fingerprinting SHARED_SINGLE_CELL_DIR # Directory used by DLP process to create metadata.yaml (should happen automatically)


### Other

LOCAL_MEM # GB of memory to give a process (e.g. demultiplexing) if executor=local


## Crontab Setup
The pipeline can be kicked off automatically by the `crontab/detect_copied_sequencers.sh` script. Add the following
to enable the crontab

crontab -e

SHELL=/bin/bash

Add path to bsub executable

PATH=${PATH}:/igoadmin/lsfigo/lsf10/10.1/linux3.10-glibc2.17-x86_64/bin

Load the LSF profile prior to running the command

Docker Container

Note - Changes to the github actions using the docker file need to be tagged, e.g.

VERSION_NUMBER=...      # e.g. "v2"
git add ...
git commit -m "My change"
git tag -a -m "very important change" ${VERSION_NUMBER} 
git push --follow-tags

REF

Debug

Demultiplexing

I) Samplesheet index length doesn't match RunInfo.xml length

BCL2FASTQ

DRAGEN

    ERROR: Sample Sheet Error: SampleSheet.csv sample #1 (index 'GGTGAACC') has an index of length 8 bases, 
    but a length of 10 was expected based upon RunInfo.xml or the OverrideCycles setting.

Solution: Mask the indices by adding the OverrideCycles option to the SampleSheet

```
[Settings],,,,,,,,
OverrideCycles,Y151;I8N2;I8N2;Y151
,,,,,,,,
```

II) Only the ___MD.txt file is available in the STATSDONEDIR for a sample

III) My github action change isn't reflecting in the integrated tests