nanoporetech / pipeline-umi-amplicon

Workflow to prepare high accuracy single molecule consensus sequences from amplicon data using unique molecular identifiers
Other
28 stars 11 forks source link

Cluster loss during rule reformat_filter_clusters #7

Open kathleenabadie opened 3 years ago

kathleenabadie commented 3 years ago

Hello, During the rule_reformat_filter_clusters step, only 36% of the clusters are being written, and I am not sure why. I understand this is happening in parse_clusters.py but have not been able to identify the reason for this large loss of clusters identified. Do you have any tips for troubleshooting this? My complete run output is copied below. See step 6, where only 36% of the initial 49660 clusters are written:

$ snakemake -j 30 reads --configfile config_te_pilot.yml Targets: target_bcl11b_te Building DAG of jobs... Using shell: /bin/bash Provided cores: 30 Rules claiming more threads will be scaled down. Job counts: count jobs 1 cluster 1 cluster_consensus 1 copy_bed 1 detect_umi_consensus_fasta 1 detect_umi_fasta 1 map_1d 2 map_consensus 1 polish_clusters 1 reads 1 reformat_consensus_clusters 1 reformat_filter_clusters 1 seqkit_bam_acc_tsv 1 split_reads 14 Select jobs to execute...

[Wed Dec 23 15:12:18 2020] rule map_1d: input: data/te_pilot/20201214_sequencing_run.fastq.tar.gz, data/te_pilot/bcl11b_te_reference.fasta output: te_pilot/align/1d.bam, te_pilot/align/1d.bam.bai jobid: 11 wildcards: name=te_pilot threads: 30

[M::mm_idx_gen::0.0020.00] collected minimizers [M::mm_idx_gen::0.0073.29] sorted minimizers [M::main::0.0073.27] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0083.18] mid_occ = 2 [M::mm_idx_stat] kmer size: 13; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.0083.11] distinct minimizers: 528 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.225 [M::worker_pipeline::34.1579.68] mapped 213509 sequences [M::worker_pipeline::54.73413.15] mapped 212395 sequences [M::worker_pipeline::64.72414.21] mapped 211398 sequences [M::worker_pipeline::79.61315.89] mapped 214326 sequences [M::worker_pipeline::115.07715.69] mapped 214710 sequences [M::worker_pipeline::119.20515.19] mapped 215066 sequences [M::worker_pipeline::139.77615.51] mapped 216572 sequences [M::worker_pipeline::151.83815.92] mapped 218548 sequences [M::worker_pipeline::183.75116.53] mapped 217083 sequences [M::worker_pipeline::187.47316.23] mapped 209564 sequences [M::worker_pipeline::207.91216.37] mapped 209419 sequences [M::worker_pipeline::218.89416.57] mapped 209530 sequences [M::worker_pipeline::250.81016.79] mapped 208914 sequences [M::worker_pipeline::255.00116.54] mapped 210411 sequences [M::worker_pipeline::275.62916.72] mapped 208463 sequences [M::worker_pipeline::286.47516.80] mapped 209649 sequences [M::worker_pipeline::320.77616.77] mapped 209401 sequences [M::worker_pipeline::324.87016.58] mapped 208663 sequences [M::worker_pipeline::344.30216.64] mapped 208906 sequences [M::worker_pipeline::355.86816.81] mapped 209277 sequences [M::worker_pipeline::385.73816.95] mapped 209489 sequences [M::worker_pipeline::389.32916.81] mapped 208428 sequences [M::worker_pipeline::409.74616.84] mapped 210045 sequences [M::worker_pipeline::420.26416.90] mapped 210041 sequences [M::worker_pipeline::436.15117.33] mapped 210967 sequences [M::worker_pipeline::460.69417.09] mapped 211264 sequences [M::worker_pipeline::462.82017.01] mapped 121987 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax map-ont -k 13 -t 30 data/te_pilot/bcl11b_te_reference.fasta - [M::main] Real time: 462.893 sec; CPU: 7874.300 sec; Peak RSS: 2.822 GB [bam_sort_core] merging from 30 files and 5 in-memory blocks... [Wed Dec 23 15:24:01 2020] Finished job 11. 1 of 14 steps (7%) done Select jobs to execute...

[Wed Dec 23 15:24:01 2020] rule split_reads: input: te_pilot/align/1d.bam output: te_pilot/fasta_filtered, te_pilot/stats/umi_filter_reads_stats.txt jobid: 10 wildcards: name=te_pilot

[Wed Dec 23 15:24:01 2020] rule copy_bed: input: data/te_pilot/te_target_amplicon.bed output: te_pilot/targets.bed jobid: 1 wildcards: name=te_pilot

[Wed Dec 23 15:24:01 2020] Finished job 1. 2 of 14 steps (14%) done Reads found: 6763079 Reads unmapped: 488026 (0%) target_bcl11b_te Reads found: 3363560 On target: 5129999 (75%) 841189 concatamers - 16% 925250 short - 18% [Wed Dec 23 15:36:15 2020] Finished job 10. 3 of 14 steps (21%) done Select jobs to execute...

[Wed Dec 23 15:36:15 2020] rule detect_umi_fasta: input: te_pilot/fasta_filtered output: te_pilot/fasta_umi/target_bcl11b_te_detected_umis.fasta jobid: 9 wildcards: name=te_pilot, target=target_bcl11b_te

Counting reads in te_pilot/fasta_filtered/target_bcl11b_te.fastq 100%|███████████████████████████████████████████████████████████████| 3363560/3363560 [03:35<00:00, 15598.04it/s] Found 1656884 fwd and 1706676 rev reads (ratio: 0.9708251595499087) 71.71184697166098% of reads contained both UMIs with max 3 mismatches [Wed Dec 23 15:40:11 2020] Finished job 9. 4 of 14 steps (29%) done Select jobs to execute...

[Wed Dec 23 15:40:11 2020] rule cluster: input: te_pilot/fasta_umi/target_bcl11b_te_detected_umis.fasta output: te_pilot/clustering/target_bcl11b_te/clusters_centroid.fasta, te_pilot/clustering/target_bcl11b_te/clusters_consensus.fasta, te_pilot/clustering/target_bcl11b_te/vsearch_clusters jobid: 8 wildcards: name=te_pilot, target=target_bcl11b_te threads: 10

vsearch v2.15.1_linux_x86_64, 125.8GB RAM, 64 cores https://github.com/torognes/vsearch

Reading file te_pilot/fasta_umi/target_bcl11b_te_detected_umis.fasta 100%
134176110 nt in 2410777 seqs, min 50, max 60, avg 56 maxseqlength 60: 1294 sequences discarded. Sorting by length 100% Counting k-mers 100% Clustering 100%
Sorting clusters 100% Writing clusters 100% Clusters: 49660 Size min 1, max 2789, avg 48.5 Singletons: 13422, 0.6% of seqs, 27.0% of clusters Multiple alignments 100% [Wed Dec 23 15:43:31 2020] Finished job 8. 5 of 14 steps (36%) done Select jobs to execute...

[Wed Dec 23 15:43:31 2020] rule reformat_filter_clusters: input: te_pilot/clustering/target_bcl11b_te/clusters_consensus.fasta, te_pilot/clustering/target_bcl11b_te/vsearch_clusters output: te_pilot/clustering/target_bcl11b_te/clusters_fa, te_pilot/stats/target_bcl11b_te_vsearch_cluster_stats.tsv, te_pilot/clustering/target_bcl11b_te/smolecule_clusters.fa jobid: 7 wildcards: name=te_pilot, target=target_bcl11b_te

100%|█████████████████████████████████████████████████████████████████████| 49660/49660 [01:09<00:00, 713.84it/s] Clusters: 36% written (18041) Reads: 2410777 found Reads: 0 removed (0.0%) Reads: 25% written Reads: 95% in written clusters [Wed Dec 23 15:44:42 2020] Finished job 7. 6 of 14 steps (43%) done Select jobs to execute...

[Wed Dec 23 15:44:42 2020] rule polish_clusters: input: te_pilot/clustering/target_bcl11b_te/clusters_fa, te_pilot/clustering/target_bcl11b_te/smolecule_clusters.fa output: te_pilot/fasta/target_bcl11b_te_consensus_tmp, te_pilot/fasta/target_bcl11b_te_consensus.bam, te_pilot/fasta/target_bcl11b_te_consensus.fasta jobid: 6 wildcards: name=te_pilot, target=target_bcl11b_te threads: 30

[Wed Dec 23 21:31:06 2020] Finished job 6. 7 of 14 steps (50%) done Select jobs to execute...

[Wed Dec 23 21:31:06 2020] rule detect_umi_consensus_fasta: input: te_pilot/fasta/target_bcl11b_te_consensus.fasta output: te_pilot/fasta_umi/target_bcl11b_te_detected_umis_final.fasta jobid: 5 wildcards: name=te_pilot, target=target_bcl11b_te

[Wed Dec 23 21:31:06 2020] rule map_consensus: input: te_pilot/fasta/target_bcl11b_te_consensus.fasta, data/te_pilot/bcl11b_te_reference.fasta output: te_pilot/align/target_bcl11b_te_consensus.bam, te_pilot/align/target_bcl11b_te_consensus.bam.bai jobid: 13 wildcards: name=te_pilot, target=target_bcl11b_te, type=consensus threads: 3

[M::mm_idx_gen::0.0030.00] collected minimizers [M::mm_idx_gen::0.0050.84] sorted minimizers [M::main::0.0050.83] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0050.80] mid_occ = 2 [M::mm_idx_stat] kmer size: 13; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.005*0.78] distinct minimizers: 528 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.225 Counting reads in te_pilot/fasta/target_bcl11b_te_consensus.fasta 100%|███████████████████████████████████████████████████████████████████| 18043/18043 [00:01<00:00, 15345.54it/s] Found 18015 fwd and 28 rev reads (ratio: 643.3928571428571) 99.29058360583052% of reads contained both UMIs with max 3 mismatches [Wed Dec 23 21:31:08 2020] Finished job 5. 8 of 14 steps (57%) done Select jobs to execute...

[Wed Dec 23 21:31:08 2020] rule cluster_consensus: input: te_pilot/fasta_umi/target_bcl11b_te_detected_umis_final.fasta output: te_pilot/clustering_consensus/target_bcl11b_te/clusters_centroid.fasta, te_pilot/clustering_consensus/target_bcl11b_te/clusters_consensus.fasta, te_pilot/clustering_consensus/target_bcl11b_te/vsearch_clusters jobid: 4 wildcards: name=te_pilot, target=target_bcl11b_te threads: 10

vsearch v2.15.1_linux_x86_64, 125.8GB RAM, 64 cores https://github.com/torognes/vsearch

Reading file te_pilot/fasta_umi/target_bcl11b_te_detected_umis_final.fasta 100%
1004613 nt in 17914 seqs, min 51, max 60, avg 56 maxseqlength 60: 1 sequence discarded. Sorting by length 100% Counting k-mers 100% Clustering 100%
Sorting clusters 100% Writing clusters 100% Clusters: 5796 Size min 1, max 19, avg 3.1 Singletons: 1756, 9.8% of seqs, 30.3% of clusters Multiple alignments 100% [Wed Dec 23 21:31:09 2020] Finished job 4. 9 of 14 steps (64%) done Select jobs to execute...

[Wed Dec 23 21:31:09 2020] rule reformat_consensus_clusters: input: te_pilot/clustering_consensus/target_bcl11b_te/clusters_consensus.fasta output: te_pilot/fasta/target_bcl11b_te_final.fasta jobid: 3 wildcards: name=te_pilot, target=target_bcl11b_te

[Wed Dec 23 21:31:09 2020] Finished job 3. 10 of 14 steps (71%) done Select jobs to execute...

[Wed Dec 23 21:31:09 2020] rule map_consensus: input: te_pilot/fasta/target_bcl11b_te_final.fasta, data/te_pilot/bcl11b_te_reference.fasta output: te_pilot/align/target_bcl11b_te_final.bam, te_pilot/align/target_bcl11b_te_final.bam.bai jobid: 2 wildcards: name=te_pilot, target=target_bcl11b_te, type=final threads: 3

[M::mm_idx_gen::0.0010.00] collected minimizers [M::mm_idx_gen::0.0020.00] sorted minimizers [M::main::0.0020.00] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0020.00] mid_occ = 2 [M::mm_idx_stat] kmer size: 13; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.0021.73] distinct minimizers: 528 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.225 [M::worker_pipeline::2.1132.86] mapped 5796 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax map-ont -k 13 -t 3 data/te_pilot/bcl11b_te_reference.fasta te_pilot/fasta/target_bcl11b_te_final.fasta [M::main] Real time: 2.114 sec; CPU: 6.044 sec; Peak RSS: 0.033 GB [bam_sort_core] merging from 0 files and 5 in-memory blocks... [Wed Dec 23 21:31:11 2020] Finished job 2. 11 of 14 steps (79%) done [M::worker_pipeline::6.481*2.88] mapped 18043 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax map-ont -k 13 -t 3 data/te_pilot/bcl11b_te_reference.fasta te_pilot/fasta/target_bcl11b_te_consensus.fasta [M::main] Real time: 6.485 sec; CPU: 18.676 sec; Peak RSS: 0.069 GB [bam_sort_core] merging from 0 files and 5 in-memory blocks... [Wed Dec 23 21:31:13 2020] Finished job 13. 12 of 14 steps (86%) done Select jobs to execute...

[Wed Dec 23 21:31:13 2020] rule seqkit_bam_acc_tsv: input: te_pilot/align/target_bcl11b_te_consensus.bam output: te_pilot/stats/target_bcl11b_te_consensus_size_vs_acc.tsv jobid: 12 wildcards: name=te_pilot, target=target_bcl11b_te, stage=consensus

[Wed Dec 23 21:31:14 2020] Finished job 12. 13 of 14 steps (93%) done Select jobs to execute...

[Wed Dec 23 21:31:14 2020] localrule reads: input: te_pilot/targets.bed, te_pilot/align/target_bcl11b_te_final.bam.bai, te_pilot/stats/target_bcl11b_te_vsearch_cluster_stats.tsv, te_pilot/stats/target_bcl11b_te_consensus_size_vs_acc.tsv jobid: 0

[Wed Dec 23 21:31:14 2020] Finished job 0. 14 of 14 steps (100%) done

SemiQuant commented 3 years ago

I haven't really familiarized myself with the code yet, but I think some is due to singleton clusters and the rest may be due to the balance_strands flag, which I'm not really sure what it's doing or why but it needs the same forward and reverse reads for each cluster? (or something like that.. which would make sense if its the same read and the UMI on each end, but I'm not sure). Anyway, try adding 'balance_strands: False' to your config file and see if that does anything.