Closed matthuska closed 10 months ago
Current status:
results/
├── clean
│ ├── ERR10008447.fastq.gz
│ └── ERR10035961.fastq.gz
├── intermediate
│ ├── map-to-keep
│ │ ├── ERR10008447.mapped.fastq.gz
│ │ ├── ERR10008447.unmapped.fastq.gz
│ │ ├── ERR10035961.mapped.fastq.gz
│ │ └── ERR10035961.unmapped.fastq.gz
│ └── map-to-remove
│ ├── ERR10008447.mapped.fastq.gz
│ ├── ERR10008447.unmapped.fastq.gz
│ ├── ERR10035961.mapped.fastq.gz
│ └── ERR10035961.unmapped.fastq.gz
├── logs
│ ├── execution_report_2023-11-16_13-08-20.html
│ └── execution_timeline_2023-11-16_13-08-20.html
└── removed
├── ERR10008447.fastq.gz
└── ERR10035961.fastq.gz
7 directories, 14 files
Still need to check if Illumina data (single and paired end) is organized properly, but for now ONT data seems to be fine.
Current output:
results
├── clean
│ ├── ERR10008447.fastq.gz
│ └── ERR10035961.fastq.gz
├── intermediate
│ ├── map-to-keep
│ │ ├── ERR10008447.mapped.fastq.gz
│ │ ├── ERR10008447.sorted.bam
│ │ ├── ERR10008447.sorted.bam.bai
│ │ ├── ERR10008447.sorted.flagstats.txt
│ │ ├── ERR10008447.sorted.idxstats.tsv
│ │ ├── ERR10008447.unmapped.fastq.gz
│ │ ├── ERR10035961.mapped.fastq.gz
│ │ ├── ERR10035961.sorted.bam
│ │ ├── ERR10035961.sorted.bam.bai
│ │ ├── ERR10035961.sorted.flagstats.txt
│ │ ├── ERR10035961.sorted.idxstats.tsv
│ │ ├── ERR10035961.unmapped.fastq.gz
│ │ ├── soft-clipped
│ │ │ ├── ERR10008447.passed-clipped.bam
│ │ │ ├── ERR10008447.passed-clipped.bam.bai
│ │ │ ├── ERR10008447.soft-clipped.bam
│ │ │ ├── ERR10008447.soft-clipped.bam.bai
│ │ │ ├── ERR10035961.passed-clipped.bam
│ │ │ ├── ERR10035961.passed-clipped.bam.bai
│ │ │ ├── ERR10035961.soft-clipped.bam
│ │ │ └── ERR10035961.soft-clipped.bam.bai
│ │ └── strict-dcs
│ │ ├── ERR10035961.false-dcs.bam
│ │ ├── ERR10035961.false-dcs.bam.bai
│ │ ├── ERR10035961.no-dcs.bam
│ │ ├── ERR10035961.no-dcs.bam.bai
│ │ ├── ERR10035961.true-dcs.bam
│ │ └── ERR10035961.true-dcs.bam.bai
│ └── map-to-remove
│ ├── ERR10008447.mapped.fastq.gz
│ ├── ERR10008447.sorted.bam
│ ├── ERR10008447.sorted.bam.bai
│ ├── ERR10008447.sorted.flagstats.txt
│ ├── ERR10008447.sorted.idxstats.tsv
│ ├── ERR10008447.unmapped.fastq.gz
│ ├── ERR10035961.mapped.fastq.gz
│ ├── ERR10035961.sorted.bam
│ ├── ERR10035961.sorted.bam.bai
│ ├── ERR10035961.sorted.flagstats.txt
│ ├── ERR10035961.sorted.idxstats.tsv
│ ├── ERR10035961.unmapped.fastq.gz
│ ├── soft-clipped
│ │ ├── ERR10008447.passed-clipped.bam
│ │ ├── ERR10008447.passed-clipped.bam.bai
│ │ ├── ERR10008447.soft-clipped.bam
│ │ ├── ERR10008447.soft-clipped.bam.bai
│ │ ├── ERR10035961.passed-clipped.bam
│ │ ├── ERR10035961.passed-clipped.bam.bai
│ │ ├── ERR10035961.soft-clipped.bam
│ │ └── ERR10035961.soft-clipped.bam.bai
│ └── strict-dcs
│ ├── ERR10035961.false-dcs.bam
│ ├── ERR10035961.false-dcs.bam.bai
│ ├── ERR10035961.no-dcs.bam
│ ├── ERR10035961.no-dcs.bam.bai
│ ├── ERR10035961.true-dcs.bam
│ └── ERR10035961.true-dcs.bam.bai
├── logs
│ ├── execution_report_2023-11-17_14-40-24.html
│ └── execution_timeline_2023-11-17_14-40-24.html
├── qc
│ └── multiqc_report.html
└── removed
├── ERR10008447.fastq.gz
└── ERR10035961.fastq.gz
12 directories, 59 files
I have a problem with --keep
:
nextflow run clean.nf --input_type nano --input test-data/SARSCoV2-nanopore.fastq.gz --host hsa --control dcs -profile local,docker --keep test-data/wuhan.fasta
results in
ERROR ~ Error executing process > 'prepare_keep (1)'
Caused by:
Process `prepare_keep (1)` terminated with an error exit status (4)
Command executed:
# -L for following a symbolic link
if ! ( file -L wuhan.fasta | grep -q 'BGZF; gzip compatible\|gzip compressed' ); then
sed -i -e '$a\' wuhan.fasta
bgzip -@ 8 < wuhan.fasta > wuhan.fasta.gz
# now wuhan.fasta'.gz'
else
mv wuhan.fasta wuhan.fasta.tmp
zcat wuhan.fasta.tmp | sed -e '$a\' | bgzip -@ 8 -c > wuhan.fasta.gz
fi
Command exit status:
4
Command output:
(empty)
Command error:
.command.sh: line 3: file: command not found
sed: couldn't open temporary file ./sedUnb1iU: Permission denied
which I think is a problem with this line:
I am not exactly sure what's happening here? You are checking if the file is gz compressed and if not, you are doing an in-place sed
on the linked file... which should not work, or? This would also change the original file?
Would it not be better to do a
sed -e '\$a\\' ${fasta} > ${fasta}.tmp
bgzip -@ ${task.cpus} < ${fasta}.tmp > ${fasta}.gz
instead of the current code?
sed -i -e '\$a\\' ${fasta}
bgzip -@ ${task.cpus} < ${fasta} > ${fasta}.gz
If I change that, this works for me.
Note: I am testing on a Macbook Air M2... where sed
might do strange stuff. On the other hand, I am using the minimap2
Docker container which includes the Linux sed
Another thing when testing Illumina data:
nextflow run clean.nf --input_type illumina --input 'test-data/SARSCoV2-illumina.R{1,2}.fastq.gz' --host hsa --control phix -profile local,docker
results in
ERROR ~ No such variable: nanoControlBedChannel
-- Check script 'clean.nf' at line: 197 or see '.nextflow.log' file for more details
Ah maybe there is just a nanoControlBedChannel = []
missing in
// load control fasta sequence
if ( params.control ) {
if ( 'phix' in params.control.split(',') ) {
illuminaControlFastaChannel = Channel.fromPath( workflow.projectDir + '/data/controls/phix.fa.gz' , checkIfExists: true )
} else { illuminaControlFastaChannel = Channel.empty() }
...
EDIT: yep. I commit the change.
Attention: I also submitted now my change to the sed
command which was not working otherwise.
hm, but now there is another problem w/ Illumina after I fixed the empty BED channel thing:
nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa --control phix -profile local,docker
executor > local (3)
[skipped ] process > prepare_contamination:prepare_auto_host:download_host (1) [100%] 1 of 1, stored: 1 ✔
[21/73e43d] process > prepare_contamination:concat_contamination [100%] 1 of 1 ✔
[9f/7bc61a] process > clean:minimap2 (1) [100%] 1 of 1, failed: 1 ✘
[- ] process > clean:sort_bam -
[- ] process > clean:index_bam -
[- ] process > clean:idxstats_from_bam -
[- ] process > clean:flagstats_from_bam -
[- ] process > clean:split_bam -
[- ] process > clean:index_bam2 -
[- ] process > clean:fastq_from_bam -
[18/01ff03] process > qc:fastqc (1) [100%] 1 of 1 ✔
[- ] process > qc:multiqc [ 0%] 0 of 1
ERROR ~ Error executing process > 'clean:minimap2 (1)'
Caused by:
Process `clean:minimap2 (1)` terminated with an error exit status (1)
Command executed:
minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam
Command exit status:
1
Command output:
(empty)
Command error:
[M::mm_idx_gen::118.418*1.88] collected minimizers
[main_samview] fail to read the header from "-".
Maybe we have to move some of these errors as issues to finalize this reorganization PR.
@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.
@hoelzer : regarding the sed issues in the prepare_keep process. This should just be replaced with seqkit I think. I'll prepare and then push something, then we can debate it here.
ok thx!
I am also not sure why
nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host hsa -profile local,docker
ERROR ~ Error executing process > 'clean:minimap2 (1)'
Caused by:
Process `clean:minimap2 (1)` terminated with an error exit status (1)
Command executed:
minimap2 -ax sr -N 5 --split-prefix tmp --secondary=no -t 8 db.fa.gz illumina_1.fastq.gz illumina_2.fastq.gz | samtools view -bhS -@ 8 > illumina.bam
Command exit status:
1
Command output:
(empty)
Command error:
[M::mm_idx_gen::117.876*1.85] collected minimizers
[main_samview] fail to read the header from "-".
is happening. But looking into this right now
alright, forget this problem. I think this is a RAM issue. Everything fine when I am choosing a smaller --host such as eco
ok, I also checked the Illumina branch. It works for me, but the output still needs the same structuring as the ONT branch. Currently, I get for the command:
nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker
❯ tree results
results
├── illumina.mapped_1.fastq.gz
├── illumina.mapped_2.fastq.gz
├── illumina.mapped_singleton.fastq.gz
├── illumina.unmapped_1.fastq.gz
├── illumina.unmapped_2.fastq.gz
├── illumina.unmapped_singleton.fastq.gz
├── intermediate
│ ├── host.fa.fai
│ ├── host.fa.gz
│ └── map-to-remove
│ ├── illumina.mapped_1.fastq.gz
│ ├── illumina.mapped_2.fastq.gz
│ ├── illumina.mapped_singleton.fastq.gz
│ ├── illumina.sorted.bam
│ ├── illumina.sorted.bam.bai
│ ├── illumina.sorted.flagstats.txt
│ ├── illumina.sorted.idxstats.tsv
│ ├── illumina.unmapped_1.fastq.gz
│ ├── illumina.unmapped_2.fastq.gz
│ └── illumina.unmapped_singleton.fastq.gz
├── logs
│ ├── execution_report_2023-12-11_17-33-43.html
│ └── execution_timeline_2023-12-11_17-33-43.html
└── qc
└── multiqc_report.html
so the clean
and removed
folders are missing and the output still has the potentially misleading mapped
/unmapped
labeling. Can this be easily changed? thx!
Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.
Hmm I thought Illumina was working, I guess not. I'll fix that tomorrow.
thx! Besides, all looks good from my side. (and I can then also test the seqkit
replacement for the sed
command, let me know if the container needs to be updated as well in that context... ah but probably you can just switch from minimap2
to the seqkit
label using the container that is already in the config)
I had to add samtools and tabix (for bgzip) to the seqkit conda env, since all of those tools are necessary to create the block gzipped and indexed reference genome you want. You'll see the changes in the env file, and then I guess you'll need to create a matching container.
Switch to seqkit for check_own and concat_contamination
Alright thx! Yes that looks much cleaner ;)
I generated a matching container for seqkit and the dependencies and also tested it.
nextflow run clean.nf --input_type illumina --input 'test/illumina{_1,_2}.fastq.gz' --host eco --control phix -profile local,docker
works for me. I push the updated config container file.
Now, from my pov, currently only the correct Illumina output (folders for clean and removed) is missing
Reorganize the
results/
directory to make it easier to understand. The current goal is:(intermediate files will be prefixed with the sample name as well)
The initial PR disables all publishDir directives and only adds back publishing of clean files when the
--keep
directive is used, renaming files as required. I'll implement the rest of the file publishing in later commits on this same branch/PR.