nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

Medaka consensus error when stitching consensus chunks together #503

Closed mgils4 closed 1 month ago

mgils4 commented 2 months ago

Hi, I am encountering a bug when using Medaka, but I can't figure out how I could solve it. Would you happen to have an idea as to what I might be doing wrong. Thank you in advance.

Describe the bug I am trying to polish a consensus sequence created with Samtools consensus (from a Minimap2 alignment), but I keep getting errors whenever it reaches the stitching of the consensus chunks

The command used:

medaka_consensus -m r1041_e82_400bps_fast_g632 -i Data/q10_filtered/barcode_05_filtered.fastq -d Consensus_sequences/barcode_05_q10_consensus.fasta -o Medaka_polished/B5_Q10_polished/B5_Q10_polished.fasta

Logging Please attach any relevant logging messages. (Use ``` before and after code blocks).

TF_CPP_MIN_LOG_LEVEL is set to '3'
Checking program versions
This is medaka 1.11.3
Program    Version    Required   Pass
bcftools   1.17       1.11       True
bgzip      1.17       1.11       True
minimap2   2.28       2.11       True
samtools   1.18       1.11       True
tabix      1.17       1.11       True
[16:11:49 - MdlStrTF] Successfully removed temporary files from /tmp/tmpykjebj1u.
[16:11:49 - MdlStrTF] Successfully removed temporary files from /tmp/tmpnjs2qkcq.
Aligning basecalls to draft
Creating fai index file /mnt/studentfiles/2024/2024MBI11/Project/Minimapconsensus/Consensus_sequences/barcode_05_q10_consensus.fasta.fai
Creating mmi index file /mnt/studentfiles/2024/2024MBI11/Project/Minimapconsensus/Consensus_sequences/barcode_05_q10_consensus.fasta.map-ont.mmi
[M::mm_idx_gen::0.002*2.33] collected minimizers
[M::mm_idx_gen::0.004*2.58] sorted minimizers
[M::main::0.008*1.93] loaded/built the index for 1 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.008*1.91] distinct minimizers: 178 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.612; total length: 1177
[M::main] Version: 2.28-r1209
[M::main] CMD: minimap2 -I 16G -x map-ont -d /mnt/studentfiles/2024/2024MBI11/Project/Minimapconsensus/Consensus_sequences/barcode_05_q10_consensus.fasta.map-ont.mmi /mnt/studentfiles/2024/2024MBI11/Project/Minimapconsensus/Consensus_sequences/barcode_05_q10_consensus.fasta
[M::main] Real time: 0.008 sec; CPU: 0.015 sec; Peak RSS: 0.003 GB
[M::main::0.005*1.49] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.005*1.48] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.005*1.46] distinct minimizers: 178 (100.00% are singletons); average occurrences: 1.000; average spacing: 6.612; total length: 1177
[M::worker_pipeline::43.154*0.95] mapped 159551 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: minimap2 -x map-ont --secondary=no -L --MD -A 2 -B 4 -O 4,24 -E 2,1 -t 1 -a /mnt/studentfiles/2024/2024MBI11/Project/Minimapconsensus/Consensus_sequences/barcode_05_q10_consensus.fasta.map-ont.mmi /mnt/studentfiles/2024/2024MBI11/Project/Minimapconsensus/Data/q10_filtered/barcode_05_filtered.fastq
[M::main] Real time: 43.170 sec; CPU: 40.993 sec; Peak RSS: 0.435 GB
Running medaka consensus
[16:12:41 - Predict] Setting tensorflow inter/intra-op threads to 1/1.
[16:12:41 - Predict] Processing region(s): barcode_05_q10_SSI_RL_09_(NZ_CP039266.1:164815-166270):0-1177
[16:12:41 - Predict] Using model: /mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/lib/python3.10/site-packages/medaka/data/r1041_e82_400bps_fast_g615_model.tar.gz.
[16:12:42 - BAMFile] Creating pool of 16 BAM file sets.
[16:12:42 - Predict] Processing 1 short region(s).
[16:12:42 - MdlStrTF] Model <keras.engine.sequential.Sequential object at 0x7f56c81f0a30>
[16:12:42 - MdlStrTF] loading weights from /tmp/tmpuszo4fqy/model/variables/variables (using expect partial)
[16:12:42 - Sampler] Initializing sampler for consensus of region barcode_05_q10_SSI_RL_09_(NZ_CP039266.1:164815-166270):0-1177.
[16:12:42 - PWorker] Running inference for 0.0M draft bases.
[16:12:44 - Feature] Pileup counts do not span requested region, requested barcode_05_q10_SSI_RL_09_(NZ_CP039266.1:164815-166270):0-1177, received 0-1169.
[16:12:45 - Feature] Processed barcode_05_q10_SSI_RL_09_(NZ_CP039266.1:164815-166270):0.0-1169.0 (median depth 8197.0)
[16:12:45 - Sampler] Took 2.78s to make features.
[16:13:00 - PWorker] Batches in cache: 1.
[16:13:00 - PWorker] 82.2% Done (0.0/0.0 Mbases) in 17.9s
[16:13:00 - PWorker] Processed 1 batches
[16:13:00 - PWorker] All done, 0 remainder regions.
[16:13:00 - Predict] Finished processing all regions.
[16:13:00 - MdlStrTF] Successfully removed temporary files from /tmp/tmpuszo4fqy.
[16:13:01 - DataIndx] Loaded 1/1 (100.00%) sample files.
Traceback (most recent call last):
  File "/mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/bin/medaka", line 11, in <module>
    sys.exit(main())
  File "/mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/lib/python3.10/site-packages/medaka/medaka.py", line 814, in main
    args.func(args)
  File "/mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/lib/python3.10/site-packages/medaka/stitch.py", line 209, in stitch
    req_regions = {
  File "/mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/lib/python3.10/site-packages/medaka/stitch.py", line 210, in <setcomp>
    medaka.common.Region.from_string(r) for r in draft.references}
  File "/mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/lib/python3.10/site-packages/medaka/common.py", line 643, in from_string
    start, end = [int(b) for b in bounds.split('-')]
  File "/mnt/studentfiles/2024/2024MBI11/miniconda3/envs/Medaka/lib/python3.10/site-packages/medaka/common.py", line 643, in <listcomp>
    start, end = [int(b) for b in bounds.split('-')]
ValueError: invalid literal for int() with base 10: '166270)'
Failed to stitch consensus chunks.

Environment (if you do not have a GPU, write No GPU):

Additional context I am working with demultiplexed data as a starting point, with sequence data of about 1100-1200 bp long, with the end goal of creating a pipeline that creates viable consensus sequences from ONT data. I have also tried using a couple different models (the default model, the r1041_e82_400bps_fast_g615 model and the r1041_e82_400bps_fast_g632 as shown in the command above, but everytime I get the same error.

(PS: I am still quite new to bioinformatics, so appologies if I did something wrong/incorrect)

mgils4 commented 2 months ago

I thought at first hand that it might be due to me not using Racon before Medaka, but seeing as the process is almost finished before the error, I assumed that it might still work? For as far as I understand, the error is trying to process a string as an integer, and I would assume that that is not changed by preprocessing with Racon. Any clue as to what I might be doing wrong?

mgils4 commented 1 month ago

Is this issue being worked on @cjw85 ?

cjw85 commented 1 month ago

Medaka is tripping up over the fact that you're input scaffold sequences have names that look a bit like intermediate names it creates for itself for chunks of data

barcode_05_q10_SSI_RL_09_(NZ_CP039266.1:164815-166270)

The work around here would be to remove the NZ_CP039266.1:164815-166270 from the FASTA file you are providing to medaka.

mgils4 commented 1 month ago

Hi @cjw85, thanks for your response, though i do have a quick question. The removal of part of the header, that would be for the scaffold sequences, correct?

cjw85 commented 1 month ago

Correct

mgils4 commented 1 month ago

Hi @cjw85, that was indeed the issue, thanks for the solution. I do have a small follow-up question: if I want to polish multiple assemblies, would it be better to do this as seperate runs per assembly file, or is it possible to use a concatenated FASTA file with multiple assemblies?

cjw85 commented 1 month ago

It would be best done as separate runs.