isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
269 stars 49 forks source link

error: overlap is not transmuted! #215

Open plnspineda opened 2 years ago

plnspineda commented 2 years ago

I've seen some issues here, but most of them are because the files are not in the right order or the sequence names are not the same. 209, 203, 132, 103, 77

However, I keep encountering this error even when I check my runs and sequence. I am using Racon v1.4.20 via bioconda.

These are my commands:

minimap2 haplotype-dam.fasta haplotype-dam.fasta.gz -o haplotype-dam.paf

racon_wrapper -t 32 --split 97723193 --subsample 2636330455 20 haplotype-dam.fasta.gz haplotype-dam.paf haplotype-dam.fasta > rw-polished_haplotype-Dam.fasta

I've also checked if there are duplicates in my reads and found nothing:

gunzip -c haplotype-dam.fasta.gz | uniq -c > countreads.list
wc -l countreads.list 
23394555 countreads.list

awk '{sum+=$1} END {print sum}' countreads.list 
23394555

the read sequence names are also the same with paf.

I'm using PacBio CLR reads, which were haplotype-separated using Canu. Could there be implications using the processed reads, hence I am having this error?

The error:

[RaconWrapper::run] preparing data with rampler [RaconWrapper::run] total number of splits: 23 [RaconWrapper::run] processing data with racon [racon::Polisher::initialize] loaded target sequences 0.619989 s [racon::Polisher::initialize] loaded sequences 8.035806 s [racon::Polisher::initialize] loaded overlaps 50.450574 s [racon::Overlap::find_breaking_points] error: overlap is not transmuted!

I saw someone commented that this issue would come up with 4 out of more than 200 assemblies using racon. What could be the reason why this error is happening?

Thank you!

plnspineda commented 2 years ago

I tried racon with a subsample of the reads and a subsample of the assembly fasta, I also renamed the contigs without white spaces (just the number) and it worked....

works for the sub-samples but not with whole data?

    $ racon test2.reads.fasta test2.overlap.paf test2.fasta > test2.racon.fasta
    [racon::Polisher::initialize] loaded target sequences 1.130559 s
    [racon::Polisher::initialize] loaded sequences 0.000698 s
    [racon::Polisher::initialize] loaded overlaps 0.000076 s
    [racon::Polisher::initialize] aligned overlaps 0.052002 s
    [racon::Polisher::initialize] transformed data into windows 0.116028 s
    [racon::Polisher::polish] generating consensus [====================] 1.039283 s
    [racon::Polisher::] total = 2.832840 s
rvaser commented 2 years ago

Hello, can you please try the latest version from https://github.com/lbcb-sci/racon?

Best regards, Robert

rvaser commented 2 years ago

Actually, bioconda has v1.5.0 as well so please update and see if the error persists.

plnspineda commented 2 years ago

Hi Robert,

Thank you for replying.

this is the error I got for bioconda Raconv1.5.0

[RaconWrapper::run] preparing data with rampler [RaconWrapper::run] error: unable to run rampler!

I do not have this error for v1.4.20

(I'll just mention this 81 issue, since it's the same.)

plnspineda commented 2 years ago

Just an update in case some others have the same problem as mine.

I ran Racon with four data: 2 PacBio CLR reads and 2 from ONT reads to correct a two ONT-assembled fasta (haplotype-resolved). I was able to successfully run 1 ONT reads, however, not for the other 3 reads...

Andy-B-123 commented 1 year ago

Hi, mentioning in case someone comes across this. I had this same error happen and found out it was a mis-match between the alignment fasta headers and the fasta I provided as the input. 

I was splitting the input fasta file based on bed coordinates (basically, doing a manual parallel process as I can't install the wrapper script on my compute cluster) and found out that extraction tools using bed format (eg seqtk subseq or bedtools getfasta) will change the fasta header to include the coordinates if provided in a bed file. 
Eg: 
>contig1
ATCGACG...
>contig2
AGTCAGC...

bed file:
contig1 1 60000

seqtk subseq $fasta $bed > output.fasta
becomes:
>contig1:1-600000
ATCGACG...

Potentially an error message specifically identifying the mis-matched headers might be useful?