rki-mf1 / clean

A nextflow pipeline for decontamination of short reads, long reads and contigs
BSD 3-Clause "New" or "Revised" License
30 stars 3 forks source link

Reorganize the results directory #67

Closed matthuska closed 10 months ago

matthuska commented 11 months ago

Reorganize the results/ directory to make it easier to understand. The current goal is:

results/
    clean/
        <sample_name>.fastq.gz
    removed/
        <sample_name>.fastq.gz
    intermedate/
        to-remove/
            mapped.fastq.gz
            unmapped.fastq.gz
            mapped.bam
            unmapped.bam
            dcs-strict/
                no-dcs.bam
                true-dcs.bam
                false-dcs.bam
            soft-clipped/
                soft-clipped.bam
                passed-clipped.bam
        to-keep/
            mapped.fastq.gz
            unmapped.fastq.gz
            mapped.bam
            unmapped.bam
            dcs-strict/
                no-dcs.bam
                true-dcs.bam
                false-dcs.bam
            soft-clipped/
                soft-clipped.bam
                passed-clipped.bam
    logs/\*.html
    qc/multiqc_report.html

(intermediate files will be prefixed with the sample name as well)

The initial PR disables all publishDir directives and only adds back publishing of clean files when the --keep directive is used, renaming files as required. I'll implement the rest of the file publishing in later commits on this same branch/PR.

matthuska commented 11 months ago

Current status:

results/
├── clean
│   ├── ERR10008447.fastq.gz
│   └── ERR10035961.fastq.gz
├── intermediate
│   ├── map-to-keep
│   │   ├── ERR10008447.mapped.fastq.gz
│   │   ├── ERR10008447.unmapped.fastq.gz
│   │   ├── ERR10035961.mapped.fastq.gz
│   │   └── ERR10035961.unmapped.fastq.gz
│   └── map-to-remove
│       ├── ERR10008447.mapped.fastq.gz
│       ├── ERR10008447.unmapped.fastq.gz
│       ├── ERR10035961.mapped.fastq.gz
│       └── ERR10035961.unmapped.fastq.gz
├── logs
│   ├── execution_report_2023-11-16_13-08-20.html
│   └── execution_timeline_2023-11-16_13-08-20.html
└── removed
    ├── ERR10008447.fastq.gz
    └── ERR10035961.fastq.gz

7 directories, 14 files
matthuska commented 11 months ago

Still need to check if Illumina data (single and paired end) is organized properly, but for now ONT data seems to be fine.

Current output:

results
├── clean
│   ├── ERR10008447.fastq.gz
│   └── ERR10035961.fastq.gz
├── intermediate
│   ├── map-to-keep
│   │   ├── ERR10008447.mapped.fastq.gz
│   │   ├── ERR10008447.sorted.bam
│   │   ├── ERR10008447.sorted.bam.bai
│   │   ├── ERR10008447.sorted.flagstats.txt
│   │   ├── ERR10008447.sorted.idxstats.tsv
│   │   ├── ERR10008447.unmapped.fastq.gz
│   │   ├── ERR10035961.mapped.fastq.gz
│   │   ├── ERR10035961.sorted.bam
│   │   ├── ERR10035961.sorted.bam.bai
│   │   ├── ERR10035961.sorted.flagstats.txt
│   │   ├── ERR10035961.sorted.idxstats.tsv
│   │   ├── ERR10035961.unmapped.fastq.gz
│   │   ├── soft-clipped
│   │   │   ├── ERR10008447.passed-clipped.bam
│   │   │   ├── ERR10008447.passed-clipped.bam.bai
│   │   │   ├── ERR10008447.soft-clipped.bam
│   │   │   ├── ERR10008447.soft-clipped.bam.bai
│   │   │   ├── ERR10035961.passed-clipped.bam
│   │   │   ├── ERR10035961.passed-clipped.bam.bai
│   │   │   ├── ERR10035961.soft-clipped.bam
│   │   │   └── ERR10035961.soft-clipped.bam.bai
│   │   └── strict-dcs
│   │       ├── ERR10035961.false-dcs.bam
│   │       ├── ERR10035961.false-dcs.bam.bai
│   │       ├── ERR10035961.no-dcs.bam
│   │       ├── ERR10035961.no-dcs.bam.bai
│   │       ├── ERR10035961.true-dcs.bam
│   │       └── ERR10035961.true-dcs.bam.bai
│   └── map-to-remove
│       ├── ERR10008447.mapped.fastq.gz
│       ├── ERR10008447.sorted.bam
│       ├── ERR10008447.sorted.bam.bai
│       ├── ERR10008447.sorted.flagstats.txt
│       ├── ERR10008447.sorted.idxstats.tsv
│       ├── ERR10008447.unmapped.fastq.gz
│       ├── ERR10035961.mapped.fastq.gz
│       ├── ERR10035961.sorted.bam
│       ├── ERR10035961.sorted.bam.bai
│       ├── ERR10035961.sorted.flagstats.txt
│       ├── ERR10035961.sorted.idxstats.tsv
│       ├── ERR10035961.unmapped.fastq.gz
│       ├── soft-clipped
│       │   ├── ERR10008447.passed-clipped.bam
│       │   ├── ERR10008447.passed-clipped.bam.bai
│       │   ├── ERR10008447.soft-clipped.bam
│       │   ├── ERR10008447.soft-clipped.bam.bai
│       │   ├── ERR10035961.passed-clipped.bam
│       │   ├── ERR10035961.passed-clipped.bam.bai
│       │   ├── ERR10035961.soft-clipped.bam
│       │   └── ERR10035961.soft-clipped.bam.bai
│       └── strict-dcs
│           ├── ERR10035961.false-dcs.bam
│           ├── ERR10035961.false-dcs.bam.bai
│           ├── ERR10035961.no-dcs.bam
│           ├── ERR10035961.no-dcs.bam.bai
│           ├── ERR10035961.true-dcs.bam
│           └── ERR10035961.true-dcs.bam.bai
├── logs
│   ├── execution_report_2023-11-17_14-40-24.html
│   └── execution_timeline_2023-11-17_14-40-24.html
├── qc
│   └── multiqc_report.html
└── removed
    ├── ERR10008447.fastq.gz
    └── ERR10035961.fastq.gz

12 directories, 59 files
hoelzer commented 11 months ago

I have a problem with --keep:

nextflow run clean.nf --input_type nano --input test-data/SARSCoV2-nanopore.fastq.gz --host hsa --control dcs -profile local,docker --keep test-data/wuhan.fasta

results in

ERROR ~ Error executing process > 'prepare_keep (1)'

Caused by:
  Process `prepare_keep (1)` terminated with an error exit status (4)

Command executed:

  # -L for following a symbolic link
  if ! ( file -L wuhan.fasta | grep -q 'BGZF; gzip compatible\|gzip compressed' ); then
    sed -i -e '$a\' wuhan.fasta
    bgzip -@ 8 < wuhan.fasta > wuhan.fasta.gz
    # now wuhan.fasta'.gz'
  else
    mv wuhan.fasta wuhan.fasta.tmp
    zcat wuhan.fasta.tmp | sed -e '$a\' | bgzip -@ 8 -c > wuhan.fasta.gz
  fi

Command exit status:
  4

Command output:
  (empty)

Command error:
  .command.sh: line 3: file: command not found
  sed: couldn't open temporary file ./sedUnb1iU: Permission denied

which I think is a problem with this line:

https://github.com/hoelzer/clean/pull/67/files#diff-4d4d28fbcc3862963d2c27d9710b1069e779fdeed9defb8b25d57d5bfe9e1acdL64

I am not exactly sure what's happening here? You are checking if the file is gz compressed and if not, you are doing an in-place sed on the linked file... which should not work, or? This would also change the original file?

Would it not be better to do a

    sed -e '\$a\\' ${fasta} > ${fasta}.tmp
    bgzip -@ ${task.cpus} < ${fasta}.tmp > ${fasta}.gz

instead of the current code?

    sed -i -e '\$a\\' ${fasta}
    bgzip -@ ${task.cpus} < ${fasta} > ${fasta}.gz

If I change that, this works for me.

Note: I am testing on a Macbook Air M2... where sed might do strange stuff. On the other hand, I am using the minimap2 Docker container which includes the Linux sed

hoelzer commented 11 months ago

Another thing when testing Illumina data:

nextflow run clean.nf --input_type illumina --input 'test-data/SARSCoV2-illumina.R{1,2}.fastq.gz' --host hsa --control phix  -profile local,docker

results in

ERROR ~ No such variable: nanoControlBedChannel

 -- Check script 'clean.nf' at line: 197 or see '.nextflow.log' file for more details

Ah maybe there is just a nanoControlBedChannel = [] missing in

// load control fasta sequence
if ( params.control ) {
  if ( 'phix' in params.control.split(',') ) {
    illuminaControlFastaChannel = Channel.fromPath( workflow.projectDir + '/data/controls/phix.fa.gz' , checkIfExists: true )
  } else { illuminaControlFastaChannel = Channel.empty() }
...

EDIT: yep. I commit the change.

hoelzer commented 11 months ago

Attention: I also submitted now my change to the sed command which was not working otherwise.

hoelzer commented 11 months ago

hm, but now there is another problem w/ Illumina after I fixed the empty BED channel thing:

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa --control phix  -profile local,docker
executor >  local (3)
[skipped  ] process > prepare_contamination:prepare_auto_host:download_host (1) [100%] 1 of 1, stored: 1 ✔
[21/73e43d] process > prepare_contamination:concat_contamination                [100%] 1 of 1 ✔
[9f/7bc61a] process > clean:minimap2 (1)                                        [100%] 1 of 1, failed: 1 ✘
[-        ] process > clean:sort_bam                                            -
[-        ] process > clean:index_bam                                           -
[-        ] process > clean:idxstats_from_bam                                   -
[-        ] process > clean:flagstats_from_bam                                  -
[-        ] process > clean:split_bam                                           -
[-        ] process > clean:index_bam2                                          -
[-        ] process > clean:fastq_from_bam                                      -
[18/01ff03] process > qc:fastqc (1)                                             [100%] 1 of 1 ✔
[-        ] process > qc:multiqc                                                [  0%] 0 of 1
ERROR ~ Error executing process > 'clean:minimap2 (1)'

Caused by:
  Process `clean:minimap2 (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::118.418*1.88] collected minimizers
  [main_samview] fail to read the header from "-".

Maybe we have to move some of these errors as issues to finalize this reorganization PR.

matthuska commented 11 months ago

@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.

hoelzer commented 11 months ago

@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.

ok thx!

I am also not sure why

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa -profile local,docker

ERROR ~ Error executing process > 'clean:minimap2 (1)'

Caused by:
  Process `clean:minimap2 (1)` terminated with an error exit status (1)

Command executed:

  minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::117.876*1.85] collected minimizers
  [main_samview] fail to read the header from "-".

is happening. But looking into this right now

hoelzer commented 11 months ago

alright, forget this problem. I think this is a RAM issue. Everything fine when I am choosing a smaller --host such as eco

hoelzer commented 11 months ago

ok, I also checked the Illumina branch. It works for me, but the output still needs the same structuring as the ONT branch. Currently, I get for the command:

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker
❯ tree results
results
├── illumina.mapped_1.fastq.gz
├── illumina.mapped_2.fastq.gz
├── illumina.mapped_singleton.fastq.gz
├── illumina.unmapped_1.fastq.gz
├── illumina.unmapped_2.fastq.gz
├── illumina.unmapped_singleton.fastq.gz
├── intermediate
│   ├── host.fa.fai
│   ├── host.fa.gz
│   └── map-to-remove
│       ├── illumina.mapped_1.fastq.gz
│       ├── illumina.mapped_2.fastq.gz
│       ├── illumina.mapped_singleton.fastq.gz
│       ├── illumina.sorted.bam
│       ├── illumina.sorted.bam.bai
│       ├── illumina.sorted.flagstats.txt
│       ├── illumina.sorted.idxstats.tsv
│       ├── illumina.unmapped_1.fastq.gz
│       ├── illumina.unmapped_2.fastq.gz
│       └── illumina.unmapped_singleton.fastq.gz
├── logs
│   ├── execution_report_2023-12-11_17-33-43.html
│   └── execution_timeline_2023-12-11_17-33-43.html
└── qc
    └── multiqc_report.html

so the clean and removed folders are missing and the output still has the potentially misleading mapped/unmapped labeling. Can this be easily changed? thx!

matthuska commented 11 months ago

Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.

hoelzer commented 11 months ago

Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.

thx! Besides, all looks good from my side. (and I can then also test the seqkit replacement for the sed command, let me know if the container needs to be updated as well in that context... ah but probably you can just switch from minimap2 to the seqkit label using the container that is already in the config)

matthuska commented 11 months ago

I had to add samtools and tabix (for bgzip) to the seqkit conda env, since all of those tools are necessary to create the block gzipped and indexed reference genome you want. You'll see the changes in the env file, and then I guess you'll need to create a matching container.

hoelzer commented 11 months ago

Switch to seqkit for check_own and concat_contamination

Alright thx! Yes that looks much cleaner ;)

I generated a matching container for seqkit and the dependencies and also tested it.

nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker

works for me. I push the updated config container file.

hoelzer commented 11 months ago

Now, from my pov, currently only the correct Illumina output (folders for clean and removed) is missing