[racon::Polisher::initialize] error: duplicate sequence <file_name> with unequal data

isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:

https://github.com/lbcb-sci/racon

MIT License

259 stars 48 forks source link

[racon::Polisher::initialize] error: duplicate sequence <file_name> with unequal data #97

Open ghost opened 5 years ago

rvaser commented 5 years ago

Hi Duncan, the error means that there are two sequences with the same identifier but with different lengths (i.e. they are not equal). Are you by any chance using paired end sequences where reads in a pair have the same name up to the first white space?

Best regards, Robert

ghost commented 5 years ago

Hello Robert.

Below is an example of two header lines in my fasta file that I got as output after correcting with Canu:

1b05a25b-07ba-4a45-ac0f-4ae3ead718d9 runid=90e28bee03a6438bda9d3cb74d9e3105aa2ed89a sampleid=Barcodes read=331 ch=222 start_time=2018-10-29T13:51:43Z id=11 clr=0,2148

c50437e9-ec14-46d5-af5c-57981519079d runid=90e28bee03a6438bda9d3cb74d9e3105aa2ed89a sampleid=Barcodes read=1832 ch=390 start_time=2018-10-29T14:11:48Z id=19 clr=0,1058

In the above is the runid causing the error?

rvaser commented 5 years ago

Everything after the first space is not used (i.e. only the string before runid is stored). Make sure that all sequences in both contig and read file have unique names up to the first space.

ghost commented 5 years ago

Hi,

The sequences in both the contig and read file have unique names but I suspect may have unequal read lengths as when I used Canu it would have trimmed the sequences to output the corrected reads. Is there anyway I can change the parameters of the read lengths to be ignored by Racon and it goes ahead with the polishing?

rvaser commented 5 years ago

Are you trying to polish contigs or correct reads with each other? Please provide me with your command and descriptions of input parameters.

ghost commented 5 years ago

Hi I am try to polish the sequence reads obtained from running the ONT Minion.I only have the original fastq files to work with and dont have the fast5 files and so I cannot use Nanopolish. Hence I am trying to use Racon for polishing and getting a final polished consesus I can use for comparing. Please see below my command and input parameters for Racon:

:~$ racon -t 2 <path/to/fastq_files> <path/to/sam_file> <path/to/correctedReads.fasta.gz > correctedReads_racon.fasta

Notes: -The fastq files are the original fastq files from the Minion experiment

CorrectedReads were obtained from Canu using the following command: ./canu -p Read_prefix -d <path /to/output_for_CorrectedReads> genomeSize=2.1k -nanopore-raw /path/to/Reads.fastq
sam file was generated using minimap2 using the following command: ./minimap2 -ax map- ont <path/to/ref.fasta> <path/to/correctedReads.fasta.gz> > <path/to /correctedReads_aln.sam

After running the command for Racon I get the following: [racon::Polisher::initialize] loaded target sequences [racon::Polisher::initialize] error: duplicate sequence 1b05a25b-07ba-4a45-ac0f-4ae3ead718d9 with unequal data

I am stuck and dont know hat to do next and the only other alternative I have is to get the fast5 files and try polishing with Nanopolish instead of racon. BUt if you can assist me with resolving this error, i will greatly appreciate

rvaser commented 5 years ago

I have trouble understanding your description. What is in the <path/to/ref.fasta> file?

ghost commented 5 years ago

Hi,

I meant the directory path to access the reference genome fasta file

rvaser commented 5 years ago

What do you need the reference genome for?

ghost commented 5 years ago

It is required as input when generating an alignment in minimap2 mapping Oxford nanopore reads

rvaser commented 5 years ago

Lets start from the beginning. You have some ONT data, is it DNA or RNA? Are you trying to assemble the sequenced genome or just increase the read accuracy?

ghost commented 5 years ago

I hope you can help and perhaps a solution to debugging the error that racon gives

rvaser commented 5 years ago

If you want to polish your reads with racon you should run the following:

minimap2 -ax ava-ont --dual=yes <reads> <reads> > alignments.sam
racon -f <reads> alignments.sam <reads> > polished_reads.fasta

If you want to assemble your genome with canu and afterwards polish it again with racon, run the following:

canu -p <prefix> -d <directory> genomeSize=<size of the sequenced genome> -nanopore-raw <reads>
minimap2 -ax map-ont <canu contigs> <reads> > alignments.sam
racon <reads> alignments.sam <canu contigs> > polished_contigs.fasta

ghost commented 5 years ago

thank you for the help. I will run the commands and will let you know if all works out

ghost commented 5 years ago

Thank you for your help. i manage to get polished contigs with your suggested approach.

I now want to compare the polished contigs and the unpolished contigs using Mummer to see how efficient the polishing with racon was. If you have any suggestions of other scripts that can check the efficiency of polishing with racon, I will would very much appreciate if you can share them

rvaser commented 5 years ago

I have been only using dnadiff from the Mummer package.

ghost commented 5 years ago

may you please send me the command you use with Mummer for dnadiff?. I have version 4 beta and it doesnt seem to be giving me the output i expect

rvaser commented 5 years ago

I am running dnadiff <reference file> <assembly file> which creates several files. In *.report file is the summary of the comparison.

ghost commented 5 years ago

I assume the reference file and assembly file is in sam format? But Thanks will try it out. thank you for the wonderful program you developed. It is very valuable in the event that one does not have immediate access to the large fast5 files generated by minion but still has a convenient way of polishing the fastq sequence files

rvaser commented 5 years ago

Both files need to be in fasta format. Thank you for your kind words :)

ghost commented 5 years ago

I am running Mummer v4 beta and I keep getting 1 of two errors. This happens when I run dnadiff program. It gives the following:

Error :- multiple query file input required in SAM output format

or if I first run the nucmer program to generate the delta file and then use it as input in the dnadiff program I get the following:

Error:- could not parse delta file error- no 400

I know you are not maintaining the mummer program but I would appreciate any advice from your experience of running the program and are you using version 4 or another version?

rvaser commented 5 years ago

I have been using this one: https://github.com/marbl/MUMmer3. Did not yet try v4. You can download it with git clone https://github.com/marbl/MUMmer3 and run make in the created directory.

RDhoelzle commented 3 years ago

Hi Robert,

I'm digging up a bit of an older thread here. I'm having a very similar problem to Duncan while trying to run a pipeline similar to that described here https://www.biorxiv.org/content/10.1101/645903v3.full.pdf (though I only have amplicon reads, ~4000bp, no UMIs). I've checked through my fastq read identifiers, and as far as I can tell they're all unique. The racon manual says to input commands in this order:

racon [options ...] <sequences> <overlaps> <target sequences>

but as per the discussion above, I also attempted:

racon [options ...] <target sequences> <overlaps> <sequences>

and got the same [racon::Polisher::initialize] error: duplicate sequence <read identifier> with unequal data error. So I'm a bit stumped.

Am I missing something?

Kind regards, Robert H

Here's my pipline in short, starting from my base-called reads in fastq format:

I first quality and length trimmed my reads with NanoFilt (>3000bp, >qual 13):

NanoFilt -l 3000 -q 13 raw.fastq > trimmed.fastq

I next generated reference consensus reads with usearch (0.75 id, double stranded):

usearch -cluster_fast trimmed.fastq -id 0.75 -strand both -centroids reference.fa

Then I mapped the trimmed reads to the reference sequences with minimap2:

minimap2 -ax map-ont -t 5 reference.fa trimmed.fastq > mapped.sam

And finally attemped to polish the trimmed reads in racon (tried both configurations, got the same error):

racon -m 8 -x -6 -g -8 -w 500 -t 5 reference.fa mapped.sam trimmed.fastq > polished.fa racon -m 8 -x -6 -g -8 -w 500 -t 5 trimmed.fastq mapped.sam reference.fa > polished.fa

I checked for repeat names with the following, but all names were unique:

grep '^@[a-z|0-9]*-' raw.fastq | sort | uniq -c grep '^@[a-z|0-9]*-' trimmed.fastq | sort | uniq -c

rvaser commented 3 years ago

Hi Robert, the problem is that you have a sequence in trimmed.fastq and reference.fa that share a name. Try renaming your reference reads, rerun the minimap2 command and run racon as racon trimmed.fastq mapped.sam reference.fa.

Best regards, Robert V

RDhoelzle commented 3 years ago

Thanks Robert, this solved it