nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

medaka running but not improving assembly #476

Closed imogen-foote closed 6 months ago

imogen-foote commented 7 months ago

I have been trying to run medaka on my flye assembly (avian genome ~1.3Gb) and the job is running to completion within about 1hr and only using ~2GB mem. When I check the consensus.fasta output against my draft assembly input, the assembly stats are identical i.e. the medaka run appears to be doing nothing to improve the assembly. Because the job isn't failing I don't know how to troubleshoot this as it is not giving me any error messages (as far as I can tell).

Here is the code I used where -d points to my input assembly from flye and -i points to the reads I fed into flye for the assembly. medaka_consensus -d $draft -i $out_dir/tmp_fastq/combined.reads.fastq -o $out_dir -m r1041_e82_400bps_sup_v4.1.0

I was using the most recent version of medaka (1.11.1) which I installed in a conda env using conda-forge medaka=1.11.1. I also tried downgrading the version of medaka to 1.11.0 to see if it was a bug in the recent version but I have the same issue.

Another weird thing is that the output log states that the output directory already exists, may use old results however I know the output directory did not exist before starting the run and I create that directory at the beginning of my script.

NB. I also tried it with the flag -t ${NPROC} by specifying NPROC=$(nproc) and got the error medaka consensus: error: argument --threads: expected one argument so I ended up removing this flag from my code. Not sure if this is relevant to the issue or not.

Output and error logs provided here. medaka.41440331.err.txt medaka.41440331.out.txt

cjw85 commented 7 months ago

Hi @imogen-foote,

The very start of you log suggests to me that something is going awry in the alignment of your reads to your assembly:

Using the existing mmi index file /scale_wlg_nobackup/filesets/nobackup/vuw03922/projects/AntipodeanAlbatross/data/Chapter3/flye/blue-45G_filtered_h50/assembly.fasta.map-ont.mmi
[M::mm_idx_gen::0.001*4.72] collected minimizers
[M::mm_idx_gen::0.003*5.01] sorted minimizers
[M::main::0.003*4.84] loaded/built the index for 0 target sequence(s)
[M::mm_mapopt_update::0.004*4.71] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::0.004*4.59] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan; total length

Could you try copying the assembly out of the flye assembly directory and rerunning; I think the tools are becoming confused by some of the other index files lying in that directory.

imogen-foote commented 6 months ago

Hi @cjw85 Thanks heaps for your response. That seems to have worked and medaka has successfully finished running with improved assembly stats. Should that be standard practice (copying the assembly file to a different directory), or is that just a bug? Thank you!

cjw85 commented 6 months ago

I've never experienced this issue myself. It's not a bug in medaka per se: what I think is happening is one of the tools in used within the medaka_consensus helper script is reading thr assembly file and an index of thr reference, but those files are out of sync for some reason.