neufeld / pandaseq

PAired-eND Assembler for DNA sequences
GNU General Public License v3.0
129 stars 24 forks source link

pandaseq error "BADID" #7

Closed kdeangelis closed 11 years ago

kdeangelis commented 12 years ago

Pandaseq seems to not like the IDs in my fastq files. It will not assemble my PEs, and my run output looks like this: ubuntu@ip-10-29-191-5:~$ pandaseq -f R1-20.pandID.fastq -r R3-20.pandID.fastq -F > panda_test.fastq INFO VER pandaseq 2.0 andre@masella.name ERR BADID @VA_1101_19100_2205 STAT TIME Wed Sep 5 15:47:30 2012

STAT ELAPSED 0 STAT READS 0 STAT NOALGN 0 STAT LOWQ 0 STAT OK 0 INFO API 1

These are MiSeq runs, but I can't figure out what it doesn't like. The originals looked like this:

@VARITEK:9:000000000-A1VLD:1:1101:19100:2205 1:N:0: TNCGAAGGGGGCTAGCGTTGCTCGGAATCACTGGGCGTAAAGCGCACGTAGGCGGCTTTTTAAGTCAGAGGTGAAATCCTGGAGCTCAACTCCAGAACTGCCTTTGATACTGAGAAGCTCGAGTACGGGAGAGGTGAGTGGAACTGCGAG + ?#55<??@DDDDDBDDFEEDEFHHHEFFHHHHHHHHHHHHHHHHHHHHHHFEHHHHHEFHHFFFFFFHDFFFFFFBFFDBDDEEFFFEEFAABCB4=CEE=A,=BEEFEEE=,,5:AE*_88??1).4A?)08::A:_?#########

but I also tried these IDs @VARITEK_1101_19100_2205 and @VARITEK-1101-19100-2205

Am I interpreting this error correctly?

Jimothyh commented 12 years ago

I appear to have duplicated this error, also with MiSeq data. I got the following output:

INFO VER pandaseq 2.0 andre@masella.name ERR BADID @MISEQ:9:000000000-A1BL8:1:1101:17377:1320 1:N:0: STAT TIME Tue Sep 18 11:30:14 2012

STAT ELAPSED 0 STAT READS 0 STAT NOALGN 0 STAT LOWQ 0 STAT OK 0 INFO API 1

Program works fine with the sample data. It seems there's something about MiSeq fastq files it somehow doesn't like?

kdeangelis commented 12 years ago

I solved the problem by using flash instead: http://bioinformatics.oxfordjournals.org/content/early/2011/09/07/bioinformatics.btr507.full http://genomics.jhu.edu/software/FLASH/index.shtml

I know this does not help with pandaseq, but it might help you with your assembly.

apmasell commented 12 years ago

The sequence IDs shown are do not contain Illumina index reads. Have they been through the barcoding process?

apmasell commented 12 years ago

I've updated PANDAseq to include a -B switch that ignores the barcode. Please try it out and let me know if it works.

Jimothyh commented 12 years ago

@kdeangelis thanks for the suggestion!

@ apmasell I'm not able to test it out right now but hopefully I can let you know in a few days if it works or not.

sathishzam commented 12 years ago

@apmasell I have also encountered the same issue with MiSeq data and the -B flag did not fix the issue (again there was a separate index read that is not incorporated into forward and reverse reads I am trying to align with PANDAseq):

0x11e3850 ERR BADID HWI-M00181:9:000000000-A1H1G:1:1101:13186:1697: @HWI-M00181:9:000000000-A1H1G:1:1101:13186:1697 1:N:0:

@HWI-M00181:9:000000000-A1H1G:1:1101:13186:1697 1:N:0: TTCGGACTACCAGGGTATCTAATCCACAGCGTATCTCGTATGCCGTCTTCTGCTTGCACGTCAGAACTCCAGTCAATAATCCACAGATCTCGTTTTTCGTCTTCTTATTTCTACTTTTTTTTTCTTTTTTTTTTTTTTTTTTCTCTTCTTCTTTACTCTCTCTTCTCATTGTTCTATATGTATTCTCTTTCATGTGTGTCGTCGATGTACAGAGTTATTGCTAATGTATTAGGTCTCTTCATGATGTTTT + 5==9,++5-5--@@@@ECEE>.AEAB8-6++@CEEFGDEEEDGFDEDEEDFFFFFF;F################################################################################################################################################################################################

apmasell commented 12 years ago

I've made another change that should make that work properly. Also, in the AXIOME repository, there is a program aq-marry-illumina-index to combine the 3-part forward, index, and reverse reads into forward and reverse reads with barcodes in the sequence names. Why some sequencing centres fail to do this is beyond my comprehension.

sathishzam commented 12 years ago

Thanks, now works with the -B flag.

sparky114 commented 11 years ago

Greetings,

I'm having the same issue with BADID, even with running -B. Output below. I need the index reads separate from the forward and reverse reads so I can run my data through QIIME. Any suggestions on merging PANDAseq and QIIME? I found the following QIIME forum discussing how to do this, but again, the index reads are an issue.

https://groups.google.com/forum/#!msg/qiime-forum/CO9EmR4FH58/vNuaaOyAv-cJ

pandaseq -f cat_R1.fastq -r cat_R2.fastq -B -F -o 75 > pandassembled.fastq INFO VER pandaseq 2.4 andre@masella.name INFO ARG[0] pandaseq INFO ARG[1] -f INFO ARG[2] cat_R1.fastq INFO ARG[3] -r INFO ARG[4] cat_R2.fastq INFO ARG[5] -B INFO ARG[6] -F INFO ARG[7] -o INFO ARG[8] 75 0x694890 ERR BADID HWI-ST1360:0::38:0:0:0: @HWI-ST1360:38:C17HNACXX:5:1101:5055:3999#0/1

STAT ELAPSED 0 STAT READS 0 STAT NOALGN 0 STAT LOWQ 0 STAT BADR 0 STAT OK 0

apmasell commented 11 years ago

I have never seen an sequence identifier like this. Do you know what version of the Illumina CASAVA pipeline produced it?

As for formatting data, we have AXIOME that runs PANDAseq, imports other FASTA files and runs various QIIME analyses on them, automatically.

sparky114 commented 11 years ago

It's 1.8.2, but my sequencing center modified the run so it matches the requirements for QIIME's split_libraries_fastq.py (link below).

http://qiime.org/tutorials/processing_illumina_data.html

I want to use PANDAseq because I sequenced V4 of 16s - with our HiSeq 2500, the forward and reverse reads completely overlap so I want to error correct the paired-end reads.

So I can use AXIOME to add the index reads to the headers to run PANDAseq, then reformat the output to remove the headers and use with QIIME? Is it actually the missing index read that is causing the error? I used -B.

apmasell commented 11 years ago

Can you get the unaltered sequence? The missing index is not causing the error, it is the unusual ordering of the data. Normally, headers for CASAVA 1.8 look like HWI-ST822:85:C05C3ACXX:1:1101:1171:2104 3:N:0:TAGACA. Note the lack of # and /.

AXIOME will modify the sequence headers for QIIME compatibility.

sparky114 commented 11 years ago

I used sed to fix the headers and it appears to be working. Thank you!

richrr commented 10 years ago

MiSeq data, get the following error. Primers and barcodes have been previously trimmed. -B does not help. Kindly help. Thanks

P.S. In header of sequence, I inserted spaces before and after the 100 (run number) to avoid auto-formatting on github.

$ pandaseq −f fwd.fastq −r rev.fastq -B Ignoring extra arguments passed. You must supply both forward and reverse reads. Too confused to continue. Try -h for help. $ pandaseq −f fwd.fastq −r rev.fastq -B -w out.fasta Ignoring extra arguments passed. You must supply both forward and reverse reads. Too confused to continue. Try -h for help. $ head -10 fwd.fastq rev.fastq ==> fwd.fastq <== @M00720: 100 :000000000-A7YE1:1:1101:14230:2979 1:N:0:49 GGCACAAACGAGAGCTCGATGGCACTCTTCAAAAATCCATATCCACCTTGTGTGCAATGTTTGTTGGGAAAGTCTTTTCTTTCCCTTCATAAATATCAACCTATATCTTTAACAACATTCGTCTGATAACATATTATGAATATACTTAATTCAAAATATAACTTTCAACAACGGATCTCTTGGCTCTC +

1>>1B111>111A1AF0E0B0AA1A0BD1D1BFG01BBEGFDDFGGFGH1FBFHFHFGG2GFBFFECGFAGDFGGFHHHHHHHHHHHFBHHHGHHHHFHHHHHHHEHHHHFHHGHEBFHHGGHEHGHGGHHHFGHHHHHHHHHHHHHHHHHFGHHEHHHHHHHHGGFHEGGGGGGGHHHHGGHHHEH @M00720: 100 :000000000-A7YE1:1:1101:14946:3013 1:N:0:49 GGCAAAATAAGAGTCTCATGGCACGTCTTAAACCCATATCCACCTTGTGTGCAATGTCAGTCGATCTTCTTCATGGAGATCGACCAAACATCAACCTTTATTTTTTAACTCTTTGTCTGAAAAATATTATGAATAAACAATTCAAAATACGACTTTCAACAACGGATCTCTTGGCTCTC + 11>>1B1BB1B11BB1A3311BBBABBBF3A11BEABBGHH2BBBBEFGFFFHHHFGHFFFGEFFHGHHHHHGHGFHHHGHHFGGGHGHFHHFHHHHHHHHHGHHGHHHHHHHHHHHHGGFHFGHFGFEFHHGFGGGHHEGGHFFHFHFHHGGGFGHGFHEHHGGGC@FDFGGFFGGCG @M00720: 100 :000000000-A7YE1:1:1101:19136:3176 1:N:0:49 TATTTAACTGGCGGCGATTGCGTACCCGTCGACCAAAATTAGGGTCAACGCTACCTGTAGGAAGTGTCCGCATAAAGTGCACCGCATGGAAATGAAGACGGCCATTAGCTGTACCATACTCAGGCACACAAAAATACTGATAGCAGTCGGCGTGTGAATCATTAGCCTTGCGACCCTCGGCAGCAAGAACCATACGACCAATATCACGAAAATAGTCACGCAAAGCATTGGGATTATCATAAAACGCCTC

==> rev.fastq <== @M00720: 100 :000000000-A7YE1:1:1101:14230:2979 2:N:0:49 GAGAGCCAAGAGATCCGTTGTTGAAAGTTATATTTTGAATTAAGTATATTCATAATATGTTATCAGACGAATGTTGTTAAAGATATAGGTTGATATTTATGAAGGGAAAGAAAAGACTTTCCCAACAAACATTGCACACAAGGTGGATATGGATTTTTAAAGAGTGCCATAAAACACTCGTTTGTGAATGATCCTTCCGCTGGTTCACAATTACCATAGTGTAGATCTCGGTGGTCGCCGTATCATTAAA + A@@AAAF44ACFB5AFAAB2AAADDBD555ABDFHHFAGCDGHGDFHHHHHHGGHHHHHHGHHHHHGHGGGHGAFHHHHHHHHGFGHHGFGGHHHHHGGHHHHHHGHGHHHFFHHGFHHGHHHFHGGHFHFGGGFDGHHGGHHDCFHGHHFFHDFBFHCGFHGHFGGHFFHHGHHHHHGF?FF<?FGGHHHFGGBDHBCEGCDCDG0DDGFGHGGDDGH=FDCCFC00CGA:A?;;CDD@:D.:B;0;BF @M00720: 100 :000000000-A7YE1:1:1101:14946:3013 2:N:0:49 GAGAGCCAAGAGATCCGTTGTTGAAAGTCGTATTTTGAATTGTTTATTCATAATATTTTTCAGACAAAGAGTTAAAAAATAAAGGTTGATGTTTGGTCGATCTCCATGAAGAAGATCGACTGACATTGCACACAAGGTGGATATGGATTTAAAAAGTGCCATAAAACACTTATTATGAATGATCCCTCCGCTGGTTCACAATTACCATAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAA + A???AAC44A?CB5BBB2B2AAABBDD5DAABBFGHFAHHFGDFGEDGHHHHGHGHGHHHHGHEFGFHHHFFFFFGHHGHHHHGHGGGHGFDEGEE?F>@FFGHHDFGFHHGHHG3EGEF?EFFDFBBB?FGGFGHG?FFEGDFDFHHHH3DBGFG@@@FFGCGHEGE0GGDF@GFDFF?FH1FFF1GGCDDCFCD0<DFB0D0=D0DGG=<;;GH0;CCCE.:A;0:9A9;A.CBC0;;BBFADCAAD @M00720: 100 :000000000-A7YE1:1:1101:19136:3176 2:N:0:49 CACGCGCACACGCTCCGCTATTCAGCGTTTGATGATTGCAATGCGACAGGCTCATGCTGATGGATGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCGCAAGGCTAATGCTTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTA

apmasell commented 10 years ago

Whatever program used to trim the barcodes and primers has modified the headers. The flow cell identifier has been removed.

richrr commented 10 years ago

I do see the flowcell 000000000-A7YE1. Anyways, shouldn't the -B ignore this?

Given eg.: HWI-ST822 :85: C05C3ACXX: 1: 1101: 1171: 2104 3: N: 0: TAGACA Explanation instrument :run: flowcell: lane: tile: x: y direction: filtered: flags: tag My data @M00720 : 100 : 000000000-A7YE1: 1: 1101: 14230: 2979 1: N: 0: 49

apmasell commented 10 years ago

No, -B only ignores missing barcodes. These have barcodes, though they are numeric. Also, the error is totally unrelated to barcodes. The error is probably due to pasting a em-dash (–) from the man page instead of typing a hyphen (-).

richrr commented 10 years ago

Ah! You Sir are absolutely right! I had copied them and it got copied a em-dash instead of hyphen. Explicitly retyping, fixed it. Thanks for the very quick response. Saved me a lot of anxiety and time. A quick question: what are the defaults for the algorithm and threshold?

Thanks.

apmasell commented 10 years ago

The default algorithm is the one in the paper (simple Bayesian) and the threshold is 0.6. This is in the manual page.