isovic / racon

Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads. http://genome.cshlp.org/content/early/2017/01/18/gr.214270.116 Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/racon
MIT License
271 stars 49 forks source link

Further explanation of 'error: unequal lengths in sequence and overlap file for sequence' #68

Closed BenjaminJPerry closed 6 years ago

BenjaminJPerry commented 6 years ago

Hi,

Attempting to implement racon in nanopore polishing pipeline.

Using bowtie2 for alignment of concatenated fastq illumina read files, passing this and the concatenated read file to racon.

I get the following error code during execution:

Begining racon polish iteration: 1... [racon::Polisher::initialize] loaded target sequences [racon::Polisher::initialize] loaded sequences [racon::overlap::find_breaking_points] error: unequal lengths in sequence and overlap file for sequence M03542:170:000000000-BL6WR:1:1101:22727:1756!

Could you elaborate on the error message so I can troubleshoot?

Greatly appreciate you time!

Cheers, Ben

P.S. Here is code from pipeline:

RACONBASEDIR=racoon0"$I"
mkdir $RACONBASEDIR
INDX="$RACONBASEDIR"/"$GENOMEBASE""$I"
printf "Building bowtie2 index: $INDX\n"
printf "Buidling...\n"
sleep 1
bowtie2-build $ITERGENOME $INDX
printf "Completed buidling bowtie2 index.\n\n"

#   Align Illmuina fastq reads
printf "Concatenating Illumina reads for alignment...\n"
CATREADS="$RACONBASEDIR"/temp.fastq.gz
zcat $ILLREAD1 $ILLREAD2 | gzip -c > $CATREADS
printf "Completed concatenation.\n\n"

ALIGNDREADS="$RACONBASEDIR"/"$GENOMEBASE"."$I".sam
printf "Aligning $CATREADS to $INDX with bowtie2...\n"

bowtie2 -p 12 --fast -x $INDX -U $CATREADS -S $ALIGNDREADS
printf  "Completer bowtie2 alignment.\n"
printf "Aligned reads ouput file: $ALIGNDREADS\n\n"

#   Pass genome.iteration, illumina.reads, and iteration.alignment to racon
printf "Begining racon polish iteration: $I...\n"
POLISHEDGENOME="$RACONBASEDIR"/"$GENOMEBASE".polish."$I".fasta
racon $CATREADS $ALIGNDREADS $ITERGENOME > $POLISHEDGENOME
rvaser commented 6 years ago

Hello Ben, I think that your paired-end reads have equal names up to the first white space. Please verify that.

Best regards, Robert

BenjaminJPerry commented 6 years ago

Hi Robert,

Yes read 1 and read 2 for each paired set have equal read names up to the first white space. Is this the discrepancy with racon? Different lengths for each of the paired reads that have identical read names.

Is there a more appropriate way to handle paired-end reads using racon?

Cheers, Ben

$ zcat 24_S24_L001_R1_001.fastq.gz | grep -e @M | head @M03542:170:000000000-BL6WR:1:1101:22603:1684 1:N:0:ATCTCATG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15599:1718 1:N:0:ATCTCAGG+TTTCCTCT @M03542:170:000000000-BL6WR:1:1101:14294:1748 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:21219:1750 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:22727:1756 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:16018:1761 1:N:0:CTCTCAGG+TATCCTTT @M03542:170:000000000-BL6WR:1:1101:8108:1784 1:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19095:1785 1:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19081:1806 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15653:1812 1:N:0:ATCTCAGT+TATCCTCT $ zcat 24_S24_L001_R2_001.fastq.gz | grep -e @M | head @M03542:170:000000000-BL6WR:1:1101:22603:1684 2:N:0:ATCTCATG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15599:1718 2:N:0:ATCTCAGG+TTTCCTCT @M03542:170:000000000-BL6WR:1:1101:14294:1748 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:21219:1750 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:22727:1756 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:16018:1761 2:N:0:CTCTCAGG+TATCCTTT @M03542:170:000000000-BL6WR:1:1101:8108:1784 2:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19095:1785 2:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19081:1806 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15653:1812 2:N:0:ATCTCAGT+TATCCTCT

rvaser commented 6 years ago

Hi Ben, racon treats all reads as single ends so no two reads should have the same identifier up to the first white space. You can add 1 and 2 before the white space or remove all white spaces from read names. Don't forget to update your overlap file or recreate it.

Best regards, Robert

sarah872 commented 6 years ago

I have a similar issue: how do I deal with multi-mapping reads?

rvaser commented 6 years ago

Hello, could you please clarify what are multi-mapping reads? Do you have the same case where Illumina reads from a pair have equal names up to the first white space? Or something else?

Best regards, Robert

sarah872 commented 6 years ago

For instance the read 'HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547'

this one is twice in the read fastq because they are paired end:

$ zgrep -A 3 "HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547" reads.fastq.gz 
@HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547 
CCGCTGATCACCGCGCAAACACACAGCAGCAAAATTTCCTCCAGATCATAAAGCTTGTTCTTGTCCGAGCGGGGATCTTTGATCGTCGAAAAATGCGAGA
+
CCCFFFFFHHFHHJJJIJJJGIJJJJJJJJIJIGIJJJJIJIJJIIIIIJHIJJHGHEHFFFFFFEDDBDDDDDDBDDDDDDCDD<89>BBB<ACDDDD#
--
@HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547
TCGATATGATAGTCGTGGCCCGCCTTCTTGTTGAGCGCGATGGTAGCGCCGCCGTGCTTGGGTTGTTTGGTCTTTGCCATCGTGGAAGTGTAACCCAAGA
+
CCCFFFFFHHHHHJJJJJJJJJIJIJJIIIJJIFGIIIJGIJJGIJJHGFACDDBDDDDDDD?@DBBDCDACDDD@CCDCDDBBD?CDCCDDEDDDDDDB

and they are therefore 4 times in the alignment file:

$ grep HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547 mapped_ill-pe_hybr-genome.bwamem.sam 
HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547   0   NODE_1_length_96688_cov_60.6554 1   60  5S95M   *   0   0       CCGCTGATCACCGCGCAAACACACAGCAGCAAAATTTCCTCCAGATCATAAAGCTTGTTCTTGTCCGAGCGGGGATCTTTGATCGTCGAAAAATGCGAGCCCFFFFFHHFHHJJJIJJJGIJJJJJJJJIJIGIJJJJIJIJJIIIIIJHIJJHGHEHFFFFFFEDDBDDDDDDBDDDDDDCDD<89>BBB<ACDDDD# NM:i:0  MD:Z:95 AS:i:95 XS:i:60
HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547   0   NODE_15_length_49256_cov_64.9048    49063   60  100M    *   0   0   TCGATATGATAGTCGTGGCCCGCCTTCTTGTTGAGCGCGATGGTAGCGCCGCCGTGCTTGGGTTGTTTGGTCTTTGCCATCGTGGAAGTGTAACCCAAGA    CCCFFFFFHHHHHJJJJJJJJJIJIJJIIIJJIFGIIIJGIJJGIJJHGFACDDBDDDDDDD?@DBBDCDACDDD@CCDCDDBBD?CDCCDDEDDDDDDB    NM:i:0  MD:Z:100    AS:i:100    XS:i:0
rvaser commented 6 years ago

@sarah872, racon treats all reads as single ends so no two reads should have the same identifier up to the first white space. You should add 1 and 2 at the end of each read in a pair (e.g. @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:1 and @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:2). Don't forget to recreate the SAM file!

sarah872 commented 6 years ago

Thanks. Do you know of any available script that executes that for me?

rvaser commented 6 years ago

@sarah872, I wrote you a simple python script. You can run it with python script.py input.fastq or python script.py input_1.fastq input2.fastq (the files must be uncompressed).

I am not sure what format of read names BWA-mem expects (the input is 2 files, maybe reads from a pair need to have the same name), so if this won't work properly you can run minimap2 instead with option -ax sr.

EDIT: I edited the script so please use the latest version (there were some minor bugs).

#!/usr/bin/env python

from __future__ import print_function
import sys

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def parse_file(file_name, read_set):
    line_id = 0
    name = ''
    data = ''
    qual = ''
    valid = False
    with (open(file_name)) as f:
        for line in f:
            if (line_id == 0):
                if (valid):
                    if (len(name) == 0 or len(data) == 0 or len(data) != len(qual)):
                        eprint('File is not in FASTQ format')
                        sys.exit(1)
                    valid = False
                    if (name in read_set):
                        print(name + '2')
                    else:
                        read_set.add(name)
                        print(name + '1')
                    print(data)
                    print('+')
                    print(qual)
                name = line.rstrip().split(' ')[0]
                data = ''
                qual = ''
                line_id = 1
            elif (line_id == 1):
                if (line[0] == '+'):
                    line_id = 2
                else:
                    data += line.rstrip()
            elif (line_id == 2):
                qual += line.rstrip()
                if (len(qual) >= len(data)):
                    valid = True
                    line_id = 0

    if (valid):
        if (len(name) == 0 or len(data) == 0 or len(data) != len(qual)):
            eprint(len(name), len(data), len(qual))
            eprint('File is not in FASTQ format')
            sys.exit(1)
        if (name in read_set):
           print(name + '2')
        else:
           read_set.add(name)
           print(name + '1')
        print(data)
        print('+')
        print(qual)

if __name__ == '__main__':

    read_set = set()

    if (len(sys.argv) > 1):
        parse_file(sys.argv[1], read_set)
    if (len(sys.argv) > 2):
        parse_file(sys.argv[2], read_set)
sarah872 commented 6 years ago

Very nice of you!

unfortunately I am getting the following error:

  File "rename_pe_reads.py", line 64, in <module>
    read_set = sets.set()
AttributeError: module 'sets' has no attribute 'set'
rvaser commented 6 years ago

It should be sets.Set() as above.

sarah872 commented 6 years ago

still the same error:

AttributeError: module 'sets' has no attribute 'Set'
rvaser commented 6 years ago

Are you using python3.x? If so replace the line with read_set = set() and remove import sets from the top of the script.

sarah872 commented 6 years ago

the script is running, but it is running without finishing. Also the file size increases stadily, from the original 1.7 GB file (7 Mio reads) to 34GB (although I stopped the script then...)

rvaser commented 6 years ago

Ugh my bad, I updated the above script with everything so far (I forgot to clear variables between reads :D).

sarah872 commented 6 years ago

perfect, works! thanks!

thsyd commented 5 years ago

@rvaser, thank you for sharing the script. For newbies like me: usage: Reads supplied as .fastq, either as one file or as two.

python <scriptname.py> <input.fastq> <input2.fastq> > output.fastq

mictadlo commented 5 years ago

Hi all, Would it not be enough just to consider only R1 reads?

Thank you in advance,

Michal

rvaser commented 5 years ago

If you have decent coverage, sure.

Best regards, Robert

mictadlo commented 5 years ago

I have two types of FASTQ files:

This one is a new one with sanger style quality score:

> zcat 1740D-43-06_S0_L001_R1_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 1:N:0:NTTCAGAA+NTTCGCCT
> zcat 1740D-43-06_S0_L001_R2_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 2:N:0:NTTCAGAA+NTTCGCCT

the below dataset (old style Phred score Illumina 1.3+) is for a different organism:

> zcat OZBenth6_R1.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/1
> zcat OZBenth6_R2.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/2

Will you script be able to handle both types of headers?

Thank you in advance.

MIchal

rvaser commented 5 years ago

I think it should.

Best regards, Robert

mictadlo commented 5 years ago

First dataset example:

> zcat 1740D-43-06_S0_L001_R1_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 1:N:0:NTTCAGAA+NTTCGCCT
> zcat 1740D-43-06_S0_L001_R2_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 2:N:0:NTTCAGAA+NTTCGCCT

You script generated this:

@E00526:39:HNMN5CCXY:6:1101:3792:9931
NTCATAACCATAGAATATGTAAATCTCTAAGTAGTAGTCTAAGACACCATATCCCTTTCGGGATCCGCCTATTCATTTATGCCCTACTACTTGAATACTTCTAATTATATTCCATTACTTTGCCTATCCAGTTCCGCTTGCATCGATATTT
+
#A<AFFFJJJ7FF7-A-J<FA-AF<FFFJA<JAF-A7-<-JFJJJ<<<JJJ--F7F<AJF-FFJFFFFF-J---7----<----A--77JFJA-----77-7---A----7----7--7--7----7---7---------7--7-------
@E00526:39:HNMN5CCXY:6:1101:3792:9932
NTGGCGCTATTATTATCTCTAAGGGAAGTTAAGGATGAACTAGAAGCAGATAGATTGGTTAGACATGGTTGAACCGGCATAGCGATGGCGATGCATGCGGGGCTCGATGGCCGTCGTAATTGATTCTATTTTGAAGTATTCAAGTATTAGG
+
#<<A<A---FAFA<-FA7FFJ7AJFJ<F-FF-F<<-F-<FAF<-FJ7--<A-<-7<FJ-7-F-A-<AAFJJJ<-JFA--FF<-77AFF7<-AFJAF<F7AAA--7-7A-F-AF77A7-7-<<-7F--7<<J-A-7-AAJF7A<-<-7<A-<

Second dataset example

> zcat OZBenth6_R1.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/1
> zcat OZBenth6_R2.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/2

You script generated this:

@HWI-ST945_0069:2:1101:1483:1976#GNNNNN/11
NAGGCTCATTCTTAAAACCTATGCCATGCTTGGTTGACTTCAAATTAACAAGGACCTCATTCAANNTNTTGNNNNNAGANTTAAACTNNNNNAAGGTACC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST945_0069:2:1101:1483:1976#GNNNNN/21
NCANATTGCCCAAATATCAAAAAGGAGATGAAACCTTTTAAGAAGCAACGTGCATACATTTCTTGGGGAGGAGACTTCGGAGATGAGTCAAGCGAAGAAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Is it correct?

Thank you in advance,

Michal

rvaser commented 5 years ago

Looks fine :)

Best regards, Robert

Tdanis commented 5 years ago

@sarah872, racon treats all reads as single ends so no two reads should have the same identifier up to the first white space. You should add 1 and 2 at the end of each read in a pair (e.g. @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:1 and @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:2). Don't forget to recreate the SAM file!

Hi, for my help, what do you mean 'Don't forget to recreate the SAM file' .? Thank you

rvaser commented 5 years ago

If you have created the SAM file from a FASTQ file which has sequences with identical names up to the first white space, and then change the header to distinguish them, you have to update the existing SAM file with new sequence headers or rerun the chosen tool to obtain a new SAM file.

Best regards, Robert

naahraissa commented 2 years ago

Thanks Rvaser,

The scripts works perfectly.