ptrebert / project-diploid-assembly

Pipeline code for creating a fully haplotype-resolved assembly from a combination of PacBio/ONT long reads and Illumina Strand-seq data
MIT License
15 stars 3 forks source link

Error in rule minimap_contig_to_known_reference_alignment #13

Open wjj666 opened 3 years ago

wjj666 commented 3 years ago

Hi, I run the pipeline to with ONT and Strand-seq reads of HG002. It leads to this error:

Error in rule minimap_contig_to_known_reference_alignment: jobid: 97 output: output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.psort.sam.bam log: log/output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.minimap.log, log/output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.st-sort.log, log/output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.st-view.log (check log file(s) for error message) conda-env: /tmp/local/ericluzhang/csjjwang/data/diploid/work_dir/run_csr38/.snakemake/conda/c3cfe377 shell: rm -rfd temp/minimap/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye/hg38_GCA_p13 ; mkdir -p temp/minimap/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye/hg38_GCA_p13 && minimap2 -t 24 --secondary=no --eqx -Y -ax asm20 -m 10000 -z 10000,50 -r 50000 --end-bonus=100 -O 5,56 -E 4,1 -B 5 -R "@RG\tID:NA24385_project_pbrs2-ccs_1000_scV12-flye\tSM:NA24385" references/assemblies/hg38_GCA_p13.fasta output/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye.fasta 2> log/output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.minimap.log | samtools sort -m 8192M -T temp/minimap/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye/hg38_GCA_p13 2> log/output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.st-sort.log | samtools view -b -F 260 /dev/stdin > output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.psort.sam.bam 2> log/output/alignments/contigs_to_reference/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.st-view.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!) Job failed, going on with independent jobs.

And this is the error in the log file: NA24385_project_pbrs2-ccs_1000_scV12-flye_map-to_hg38_GCA_p13.st-sort.log

[E::sam_parse1] CIGAR and query sequence are of different length [W::sam_read1] Parse error at line 495 samtools sort: truncated file. Aborting

minimap2 --version

2.17-r941

samtools --version

samtools 1.10 Using htslib 1.10.2 Copyright (C) 2019 Genome Research Ltd.

The error shows it's related to CIGAR length issue, could you please help me solve this problem? Thanks.

ptrebert commented 3 years ago

Hi wjj666... sometimes SaaRclust tends to create huge sequence clusters (>300 Mbp) that create problems during the contig-to-reference alignment step. Can you please check the size of the assembly clusters?

This is a problem we are actively working on.

wjj666 commented 3 years ago

Hi wjj666... sometimes SaaRclust tends to create huge sequence clusters (>300 Mbp) that create problems during the contig-to-reference alignment step. Can you please check the size of the assembly clusters?

This is a problem we are actively working on.

Hi, ptrebert, thanks for your quick reply. This is the content of output/reference_assembly/clustered/NA24385_project_nextseq-npe_sseq/NA24385_project_pbrs2-ccs_1000_scV12-flye.fasta.fai:

cluster1 236706150 10 80 81 cluster10 154255392 239664998 80 81 cluster11 76780925 395848594 80 81 cluster12 75079269 473589292 80 81 cluster13 56008344 549607063 80 81 cluster14 187434834 606315523 80 81 cluster15 87561934 796093304 80 81 cluster16 738423 884749774 80 81 cluster17 116077159 885497439 80 81 cluster18 500218 1003025574 80 81 cluster19 82732890 1003532056 80 81 cluster2 130859884 1087299118 80 81 cluster20 131974599 1219794762 80 81 cluster21 2577592 1353419055 80 81 cluster22 38496522 1356028878 80 81 cluster23 141733432 1395006618 80 81 cluster24 35216108 1538511729 80 81 cluster25 61983643 1574168050 80 81 cluster26 5054252 1636926500 80 81 cluster27 424429 1642043942 80 81 cluster28 711670 1642473688 80 81 cluster29 481538 1643194265 80 81 cluster3 535034644 1643681833 80 81 cluster30 1908690 2185404422 80 81 cluster4 1188269 2187336981 80 81 cluster5 78847034 2188540114 80 81 cluster6 97862871 2268372746 80 81 cluster7 168797174 2367458913 80 81 cluster8 222646057 2538366062 80 81 cluster9 177166527 2763795205 80 81

ptrebert commented 3 years ago

I am somewhat surprised by the number of sequence clusters you got there - what Strand-seq data are you using?

Apart from that, cluster3 is larger than 500 Mbp, which would explain the error you observed. In the current state, this could only be fixed manually by breaking the sequence into 3 parts.

wjj666 commented 3 years ago

I am somewhat surprised by the number of sequence clusters you got there - what Strand-seq data are you using?

Apart from that, cluster3 is larger than 500 Mbp, which would explain the error you observed. In the current state, this could only be fixed manually by breaking the sequence into 3 parts.

The Strand-seq data was downloaded from: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/Strand-Seq_BCCRC/

This is configure information about the clusters (generated with ):

# SaaRclust parameter sets # goal is to obtain approx. # 24 clusters (for human) min_contig_size: 100000 min_region_to_order: 500000 bin_size: 200000 step_size: 200000 prob_threshold: 0.25 init_clusters: 100 desired_clusters: 24 min_mapq: 10

# this solves a known HET inversion located on chr8 sample_non_default_parameters: HG00733: use_only_in: write_saarclust_config_file: init_clusters: 150 desired_clusters: 25 NA24385: use_only_in: write_saarclust_config_file: init_clusters: 150 desired_clusters: 25 NA20847: use_only_in: write_saarclust_config_file: desired_clusters: 23 HG00864: use_only_in: write_saarclust_config_file: desired_clusters: 23

Are there problems with the data or config file I used?

ptrebert commented 3 years ago

The config is fine, but it seems the Strand-seq signal is not sufficient for a good clustering of the assembled sequences. Do you have the bandwidth to run the assembly with a different set of Strand-seq libraries for HG002 as a comparison? There is a publicly available set of libraries as part of the HPG data sources: https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0

wjj666 commented 3 years ago

Thanks. I will download the Strand-seq data and run the assembly as a comparison.

ptrebert commented 3 years ago

One thing I forgot: unless you are doing your own quality selection of Strand-seq libraries, for the HPG HG002 data, we used the following list of libraries for the assembly (CSV is in this repo): annotation/NA24385_selected_libraries_sseq.csv

wjj666 commented 3 years ago

Thank you very much. I will use this list of libraries for assembly.

wjj666 commented 3 years ago

Hi, after breaking the cluster3 into 3 parts, it run well until this error occurred:

[Sat Jun 26 12:44:20 2021] rule racon_contig_polishing_pass1: input: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h1-un.cluster29.fastq.gz, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.fasta, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.fasta.fai, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.psort.sam output: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.fasta log: log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.log jobid: 496 benchmark: run/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.t50.rsrc wildcards: var_caller=freebayes, qual=10, gq=100, reference=NA24385_project_pbrs2-ccs_1000_scV12-flye, vc_reads=NA24385_project_pbrs2-ccs_1000, sseq_reads=NA24385_project_nextseq-npe_sseq, pol_reads=NA24385_project_pbrs2-ccs_1000, hap_reads=NA24385_project_pbrs2-ccs_1000, assembler=flye, hap=h1-un, sequence=cluster29 threads: 24 resources: mem_per_cpu_mb=245, mem_total_mb=12288, runtime_hrs=1, runtime_min=59

Activating conda environment: //work_dir/run_csr38/.snakemake/conda/c3cfe377 [Sat Jun 26 12:44:24 2021] Error in rule racon_contig_polishing_pass1: jobid: 496 output: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.fasta log: log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.log (check log file(s) for error message) conda-env: //work_dir/run_csr38/.snakemake/conda/c3cfe377 shell: racon --threads 24 --include-unpolished output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h1-un.cluster29.fastq.gz output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.psort.sam output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.fasta > output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.fasta 2> log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Job failed, going on with independent jobs.

[Sat Jun 26 12:44:24 2021] rule racon_contig_polishing_pass1: input: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h2-un.cluster29.fastq.gz, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.fasta, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.fasta.fai, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.psort.sam output: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.fasta log: log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.log jobid: 592 benchmark: run/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.t50.rsrc wildcards: var_caller=freebayes, qual=10, gq=100, reference=NA24385_project_pbrs2-ccs_1000_scV12-flye, vc_reads=NA24385_project_pbrs2-ccs_1000, sseq_reads=NA24385_project_nextseq-npe_sseq, pol_reads=NA24385_project_pbrs2-ccs_1000, hap_reads=NA24385_project_pbrs2-ccs_1000, assembler=flye, hap=h2-un, sequence=cluster29 threads: 24 resources: mem_per_cpu_mb=245, mem_total_mb=12288, runtime_hrs=1, runtime_min=59

Activating conda environment: //work_dir/run_csr38/.snakemake/conda/c3cfe377 [Sat Jun 26 12:44:27 2021] Error in rule racon_contig_polishing_pass1: jobid: 592 output: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.fasta log: log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.log (check log file(s) for error message) conda-env: //work_dir/run_csr38/.snakemake/conda/c3cfe377 shell: racon --threads 24 --include-unpolished output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h2-un.cluster29.fastq.gz output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.psort.sam output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.fasta > output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.fasta 2> log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Job failed, going on with independent jobs.

[Sat Jun 26 12:44:27 2021] rule minimap_racon_polish_alignment_pass2: input: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h1-un.cluster29.fastq.gz, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h1-un.cluster29.preset.minimap, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p1.fasta output: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p2.psort.sam log: log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p2.minimap.log, log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p2.st-sort.log jobid: 498 benchmark: run/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h1-un.cluster29.racon-p2.t100.rsrc wildcards: var_caller=freebayes, qual=10, gq=100, reference=NA24385_project_pbrs2-ccs_1000_scV12-flye, vc_reads=NA24385_project_pbrs2-ccs_1000, sseq_reads=NA24385_project_nextseq-npe_sseq, pol_reads=NA24385_project_pbrs2-ccs_1000, hap_reads=NA24385_project_pbrs2-ccs_1000, assembler=flye, hap=h1-un, sequence=cluster29 threads: 24 resources: mem_per_cpu_mb=122, mem_total_mb=12288, runtime_hrs=0, runtime_min=59, mem_sort_mb=4096

Activating conda environment: /***/work_dir/run_csr38/.snakemake/conda/c3cfe377 [Sat Jun 26 12:44:31 2021] Finished job 498. 3 of 707 steps (0.42%) done

[Sat Jun 26 12:44:31 2021] rule minimap_racon_polish_alignment_pass2: input: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h2-un.cluster29.fastq.gz, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/draft/haploid_fastq/NA24385_project_pbrs2-ccs_1000.h2-un.cluster29.preset.minimap, output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.fasta output: output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p2.psort.sam log: log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p2.minimap.log, log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p2.st-sort.log jobid: 594 benchmark: run/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/alignments/NA24385_project_pbrs2-ccs_1000_map-to_NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p2.t100.rsrc wildcards: var_caller=freebayes, qual=10, gq=100, reference=NA24385_project_pbrs2-ccs_1000_scV12-flye, vc_reads=NA24385_project_pbrs2-ccs_1000, sseq_reads=NA24385_project_nextseq-npe_sseq, pol_reads=NA24385_project_pbrs2-ccs_1000, hap_reads=NA24385_project_pbrs2-ccs_1000, assembler=flye, hap=h2-un, sequence=cluster29 threads: 24 resources: mem_per_cpu_mb=122, mem_total_mb=12288, runtime_hrs=0, runtime_min=59, mem_sort_mb=4096

Activating conda environment: /***/work_dir/run_csr38/.snakemake/conda/c3cfe377 [Sat Jun 26 12:44:35 2021] Finished job 594. 4 of 707 steps (0.57%) done Exiting because a job execution failed. Look above for error message

I checked the file log/output/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project_pbrs2-ccs_1000_scV12-flye/NA24385_project_pbrs2-ccs_1000/NA24385_project_nextseq-npe_sseq/polishing/NA24385_project_pbrs2-ccs_1000/haploid_assembly/NA24385_project_pbrs2-ccs_1000-flye.h2-un.cluster29.racon-p1.log:

[racon::Polisher::initialize] loaded target sequences 0.000317 s [racon::Polisher::initialize] loaded sequences 0.426341 s [racon::Polisher::initialize] error: empty overlap set!

And the related sam files are all empty. This is the link about all related fasta and fastq.gz files:

https://drive.google.com/drive/folders/1ZBccc2DoMXL-M16cHOF0D6x-VGd3LcJP?usp=sharing

Could you please help me? Thank you very much.

ptrebert commented 3 years ago

According to the Racon github, an empty overlap error could indicate a name mismatch. Before you broke this huge sequence cluster into three parts, did you run the pipeline till the end (the contig-to-reference alignment step that failed first is just used for evaluation, i.e. producing the phased assembly should work w/o it)? Or did you break the cluster into three parts, and then Snakemake detected a change in file modification time and rerun certain downstream jobs?

The files you uploaded are mixing haplotype partitions (h1, h2, h1-un and h2-un) and are somewhat "incompatible" with each other. However, when I align the cluster29 "h2-un" reads to the respective sequences (these two files match), I do not get any alignments because the preset for the Racon polishing step sets minimap's "-m" parameter to 5000 (~alignment length), and the three contigs in the FASTA file are between ~850 and ~3500 bp only. You could lower this parameter value to get alignments, but again, since you are assembling a human genome, the contig sizes are not reasonable. The contigs are smaller than the expected average read length for any current long-read technology I know. Did you QC your input long reads?

wjj666 commented 3 years ago

Hi, it occurred this error when running the rule plot_saarclust_haploid_assembly_diagnostic_output:

[Tue Jul 6 14:22:40 2021] Error in rule plot_saarclust_haploid_assembly_diagnostic_output: jobid: 20 output: output/plotting/saarclust_diagnostics/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13.clustering.pdf, output/plotting/saarclust_diagnostics/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13.orienting.pdf, output/plotting/saarclust_diagnostics/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13.ordering.pdf log: log/output/plotting/saarclust_diagnostics/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13.saarclust-diagnostics.log (check log file(s) for error message) conda-env: /tmp/local/ericluzhang/csjjwang/data/diploid/work_dir/new_work_dir/run_csr38_new/.snakemake/conda/b8d33113 shell: /tmp/local/ericluzhang/csjjwang/data/diploid/work_dir/project-diploid-assembly/scripts/plot_saarclust_diagnostics.R output/alignments/contigs_to_reference/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13.fixed.bed hg38 output/plotting/saarclust_diagnostics/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13 NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12 TRUE &> log/output/plotting/saarclust_diagnostics/diploid_assembly/strandseq_split/freebayes_QUAL10_GQ100/NA24385_project2_pbsq2-ccs_1000_scV12-flye/NA24385_project2_pbsq2-ccs_1000/NA24385_project2_NextSeq-npe_sseq/polishing/NA24385_project2_pbsq2-ccs_1000/clustering/NA24385_project2_pbsq2-ccs_1000-flye.h1-un.racon-p2.scV12_map-to_hg38_GCA_p13.saarclust-diagnostics.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

This is the error in the log file:

Error in plt.df$ord.color[chr.idx] <- colors[plt.df$order[chr.idx]] : replacement has length zero Calls: plotClusteredContigs In addition: There were 15 warnings (use warnings() to see them) Execution halted

I am sorry that I could not find plt.df$ord.color[chr.idx] <- colors[plt.df$order[chr.idx]] in the R script: plot_saarclust_diagnostics.R. Could you please help me fix this error? Thank you very much.

ptrebert commented 3 years ago

Yes, you cannot find these lines of code in this R script because it is just the interface that calls the respective functions in the SaaRclust package, where the error is coming from. The rules that are affected by this error require the final phased and polished assemblies to be clustered (and scaffolded) once more using the Strand-seq data. This part of the pipeline has been deprecated in the current development branches b/c our internal evaluation showed that the Strand-seq signal is not generally good enough for a robust scaffolding (we are working on replacing that part using ONT reads). My recommendation would be to drop (not execute) that part of the pipeline and run your downstream analysis on the contig-level phased assemblies.