Closed BenjaminJPerry closed 6 years ago
Hello Ben, I think that your paired-end reads have equal names up to the first white space. Please verify that.
Best regards, Robert
Hi Robert,
Yes read 1 and read 2 for each paired set have equal read names up to the first white space. Is this the discrepancy with racon? Different lengths for each of the paired reads that have identical read names.
Is there a more appropriate way to handle paired-end reads using racon?
Cheers, Ben
$ zcat 24_S24_L001_R1_001.fastq.gz | grep -e @M | head @M03542:170:000000000-BL6WR:1:1101:22603:1684 1:N:0:ATCTCATG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15599:1718 1:N:0:ATCTCAGG+TTTCCTCT @M03542:170:000000000-BL6WR:1:1101:14294:1748 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:21219:1750 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:22727:1756 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:16018:1761 1:N:0:CTCTCAGG+TATCCTTT @M03542:170:000000000-BL6WR:1:1101:8108:1784 1:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19095:1785 1:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19081:1806 1:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15653:1812 1:N:0:ATCTCAGT+TATCCTCT $ zcat 24_S24_L001_R2_001.fastq.gz | grep -e @M | head @M03542:170:000000000-BL6WR:1:1101:22603:1684 2:N:0:ATCTCATG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15599:1718 2:N:0:ATCTCAGG+TTTCCTCT @M03542:170:000000000-BL6WR:1:1101:14294:1748 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:21219:1750 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:22727:1756 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:16018:1761 2:N:0:CTCTCAGG+TATCCTTT @M03542:170:000000000-BL6WR:1:1101:8108:1784 2:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19095:1785 2:N:0:CTCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:19081:1806 2:N:0:ATCTCAGG+TATCCTCT @M03542:170:000000000-BL6WR:1:1101:15653:1812 2:N:0:ATCTCAGT+TATCCTCT
Hi Ben,
racon treats all reads as single ends so no two reads should have the same identifier up to the first white space. You can add 1
and 2
before the white space or remove all white spaces from read names. Don't forget to update your overlap file or recreate it.
Best regards, Robert
I have a similar issue: how do I deal with multi-mapping reads?
Hello, could you please clarify what are multi-mapping reads? Do you have the same case where Illumina reads from a pair have equal names up to the first white space? Or something else?
Best regards, Robert
For instance the read 'HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547'
this one is twice in the read fastq because they are paired end:
$ zgrep -A 3 "HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547" reads.fastq.gz
@HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547
CCGCTGATCACCGCGCAAACACACAGCAGCAAAATTTCCTCCAGATCATAAAGCTTGTTCTTGTCCGAGCGGGGATCTTTGATCGTCGAAAAATGCGAGA
+
CCCFFFFFHHFHHJJJIJJJGIJJJJJJJJIJIGIJJJJIJIJJIIIIIJHIJJHGHEHFFFFFFEDDBDDDDDDBDDDDDDCDD<89>BBB<ACDDDD#
--
@HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547
TCGATATGATAGTCGTGGCCCGCCTTCTTGTTGAGCGCGATGGTAGCGCCGCCGTGCTTGGGTTGTTTGGTCTTTGCCATCGTGGAAGTGTAACCCAAGA
+
CCCFFFFFHHHHHJJJJJJJJJIJIJJIIIJJIFGIIIJGIJJGIJJHGFACDDBDDDDDDD?@DBBDCDACDDD@CCDCDDBBD?CDCCDDEDDDDDDB
and they are therefore 4 times in the alignment file:
$ grep HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547 mapped_ill-pe_hybr-genome.bwamem.sam
HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547 0 NODE_1_length_96688_cov_60.6554 1 60 5S95M * 0 0 CCGCTGATCACCGCGCAAACACACAGCAGCAAAATTTCCTCCAGATCATAAAGCTTGTTCTTGTCCGAGCGGGGATCTTTGATCGTCGAAAAATGCGAGCCCFFFFFHHFHHJJJIJJJGIJJJJJJJJIJIGIJJJJIJIJJIIIIIJHIJJHGHEHFFFFFFEDDBDDDDDDBDDDDDDCDD<89>BBB<ACDDDD# NM:i:0 MD:Z:95 AS:i:95 XS:i:60
HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547 0 NODE_15_length_49256_cov_64.9048 49063 60 100M * 0 0 TCGATATGATAGTCGTGGCCCGCCTTCTTGTTGAGCGCGATGGTAGCGCCGCCGTGCTTGGGTTGTTTGGTCTTTGCCATCGTGGAAGTGTAACCCAAGA CCCFFFFFHHHHHJJJJJJJJJIJIJJIIIJJIFGIIIJGIJJGIJJHGFACDDBDDDDDDD?@DBBDCDACDDD@CCDCDDBBD?CDCCDDEDDDDDDB NM:i:0 MD:Z:100 AS:i:100 XS:i:0
@sarah872, racon treats all reads as single ends so no two reads should have the same identifier up to the first white space. You should add 1 and 2 at the end of each read in a pair (e.g. @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:1 and @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:2). Don't forget to recreate the SAM file!
Thanks. Do you know of any available script that executes that for me?
@sarah872, I wrote you a simple python script. You can run it with python script.py input.fastq
or python script.py input_1.fastq input2.fastq
(the files must be uncompressed).
I am not sure what format of read names BWA-mem expects (the input is 2 files, maybe reads from a pair need to have the same name), so if this won't work properly you can run minimap2 instead with option -ax sr
.
EDIT: I edited the script so please use the latest version (there were some minor bugs).
#!/usr/bin/env python
from __future__ import print_function
import sys
def eprint(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs)
def parse_file(file_name, read_set):
line_id = 0
name = ''
data = ''
qual = ''
valid = False
with (open(file_name)) as f:
for line in f:
if (line_id == 0):
if (valid):
if (len(name) == 0 or len(data) == 0 or len(data) != len(qual)):
eprint('File is not in FASTQ format')
sys.exit(1)
valid = False
if (name in read_set):
print(name + '2')
else:
read_set.add(name)
print(name + '1')
print(data)
print('+')
print(qual)
name = line.rstrip().split(' ')[0]
data = ''
qual = ''
line_id = 1
elif (line_id == 1):
if (line[0] == '+'):
line_id = 2
else:
data += line.rstrip()
elif (line_id == 2):
qual += line.rstrip()
if (len(qual) >= len(data)):
valid = True
line_id = 0
if (valid):
if (len(name) == 0 or len(data) == 0 or len(data) != len(qual)):
eprint(len(name), len(data), len(qual))
eprint('File is not in FASTQ format')
sys.exit(1)
if (name in read_set):
print(name + '2')
else:
read_set.add(name)
print(name + '1')
print(data)
print('+')
print(qual)
if __name__ == '__main__':
read_set = set()
if (len(sys.argv) > 1):
parse_file(sys.argv[1], read_set)
if (len(sys.argv) > 2):
parse_file(sys.argv[2], read_set)
Very nice of you!
unfortunately I am getting the following error:
File "rename_pe_reads.py", line 64, in <module>
read_set = sets.set()
AttributeError: module 'sets' has no attribute 'set'
It should be sets.Set()
as above.
still the same error:
AttributeError: module 'sets' has no attribute 'Set'
Are you using python3.x? If so replace the line with read_set = set()
and remove import sets
from the top of the script.
the script is running, but it is running without finishing. Also the file size increases stadily, from the original 1.7 GB file (7 Mio reads) to 34GB (although I stopped the script then...)
Ugh my bad, I updated the above script with everything so far (I forgot to clear variables between reads :D).
perfect, works! thanks!
@rvaser, thank you for sharing the script. For newbies like me: usage: Reads supplied as .fastq, either as one file or as two.
python <scriptname.py> <input.fastq> <input2.fastq> > output.fastq
Hi all, Would it not be enough just to consider only R1 reads?
Thank you in advance,
Michal
If you have decent coverage, sure.
Best regards, Robert
I have two types of FASTQ files:
This one is a new one with sanger style quality score:
> zcat 1740D-43-06_S0_L001_R1_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 1:N:0:NTTCAGAA+NTTCGCCT
> zcat 1740D-43-06_S0_L001_R2_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 2:N:0:NTTCAGAA+NTTCGCCT
the below dataset (old style Phred score Illumina 1.3+) is for a different organism:
> zcat OZBenth6_R1.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/1
> zcat OZBenth6_R2.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/2
Will you script be able to handle both types of headers?
Thank you in advance.
MIchal
I think it should.
Best regards, Robert
First dataset example:
> zcat 1740D-43-06_S0_L001_R1_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 1:N:0:NTTCAGAA+NTTCGCCT
> zcat 1740D-43-06_S0_L001_R2_001.fastq.gz | head -n 1
@E00526:39:HNMN5CCXY:6:1101:3792:993 2:N:0:NTTCAGAA+NTTCGCCT
You script generated this:
@E00526:39:HNMN5CCXY:6:1101:3792:9931
NTCATAACCATAGAATATGTAAATCTCTAAGTAGTAGTCTAAGACACCATATCCCTTTCGGGATCCGCCTATTCATTTATGCCCTACTACTTGAATACTTCTAATTATATTCCATTACTTTGCCTATCCAGTTCCGCTTGCATCGATATTT
+
#A<AFFFJJJ7FF7-A-J<FA-AF<FFFJA<JAF-A7-<-JFJJJ<<<JJJ--F7F<AJF-FFJFFFFF-J---7----<----A--77JFJA-----77-7---A----7----7--7--7----7---7---------7--7-------
@E00526:39:HNMN5CCXY:6:1101:3792:9932
NTGGCGCTATTATTATCTCTAAGGGAAGTTAAGGATGAACTAGAAGCAGATAGATTGGTTAGACATGGTTGAACCGGCATAGCGATGGCGATGCATGCGGGGCTCGATGGCCGTCGTAATTGATTCTATTTTGAAGTATTCAAGTATTAGG
+
#<<A<A---FAFA<-FA7FFJ7AJFJ<F-FF-F<<-F-<FAF<-FJ7--<A-<-7<FJ-7-F-A-<AAFJJJ<-JFA--FF<-77AFF7<-AFJAF<F7AAA--7-7A-F-AF77A7-7-<<-7F--7<<J-A-7-AAJF7A<-<-7<A-<
Second dataset example
> zcat OZBenth6_R1.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/1
> zcat OZBenth6_R2.fq.gz | head -n 1
@HWI-ST945_0069:6:1101:1694:1991#NNNNNN/2
You script generated this:
@HWI-ST945_0069:2:1101:1483:1976#GNNNNN/11
NAGGCTCATTCTTAAAACCTATGCCATGCTTGGTTGACTTCAAATTAACAAGGACCTCATTCAANNTNTTGNNNNNAGANTTAAACTNNNNNAAGGTACC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST945_0069:2:1101:1483:1976#GNNNNN/21
NCANATTGCCCAAATATCAAAAAGGAGATGAAACCTTTTAAGAAGCAACGTGCATACATTTCTTGGGGAGGAGACTTCGGAGATGAGTCAAGCGAAGAAG
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Is it correct?
Thank you in advance,
Michal
Looks fine :)
Best regards, Robert
@sarah872, racon treats all reads as single ends so no two reads should have the same identifier up to the first white space. You should add 1 and 2 at the end of each read in a pair (e.g. @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:1 and @HWI-ST863:386:C5Y8UACXX:3:1107:14282:8547:2). Don't forget to recreate the SAM file!
Hi, for my help, what do you mean 'Don't forget to recreate the SAM file' .? Thank you
If you have created the SAM file from a FASTQ file which has sequences with identical names up to the first white space, and then change the header to distinguish them, you have to update the existing SAM file with new sequence headers or rerun the chosen tool to obtain a new SAM file.
Best regards, Robert
Thanks Rvaser,
The scripts works perfectly.
Hi,
Attempting to implement racon in nanopore polishing pipeline.
Using bowtie2 for alignment of concatenated fastq illumina read files, passing this and the concatenated read file to racon.
I get the following error code during execution:
Begining racon polish iteration: 1... [racon::Polisher::initialize] loaded target sequences [racon::Polisher::initialize] loaded sequences [racon::overlap::find_breaking_points] error: unequal lengths in sequence and overlap file for sequence M03542:170:000000000-BL6WR:1:1101:22727:1756!
Could you elaborate on the error message so I can troubleshoot?
Greatly appreciate you time!
Cheers, Ben
P.S. Here is code from pipeline: