SMALT instead of SSAHA2?

GoogleCodeExporter commented 8 years ago

Hi,

Thanks for providing your great pipeline breseq... it is exactly what we needed 
for our analysis.

I have some questions:

Is it possible to use SMALT (http://www.sanger.ac.uk/resources/software/smalt/) 
for aligning the reads? It can produce native SSAHA2 output and is 
multithreaded and is maybe faster.

Is it also possible to allow choosing the read chemistry? From looking at the 
source code it seems to index always using the "solexa" flag. We played around 
in the source-code and changed it to 454 for our data. We also intend to add 
IonTorrent data of the same strain later to our project later and performance 
might decrease more when using the "solexa" indexing.

Which FASTQ format does breseq expect (DOI:10.1093/nar/gkp1137), i.e. which 
offset? Also we found that it fails when using minimal FASTQ files:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

and expects the SEQ_ID in both sequence and quality header:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+SEQ_ID
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Is the second ID important for functionality? Because if it would work without 
we could reduce file size and it would be more robust accepting both versions.

Thanks again.

Cheers

Markus

Original issue reported on code.google.com by 000.Cala...@googlemail.com on 25 May 2012 at 7:37

GoogleCodeExporter commented 8 years ago

Ah ok. I saw at the beginning of the output it is using SANGER FASTQ.

Original comment by 000.Cala...@googlemail.com on 25 May 2012 at 9:14

GoogleCodeExporter commented 8 years ago

Hi Markus,

Re: SMALT. That's a good question. 

We are working to increase the speed of the alignment steps without reducing 
the sensitivity to indels and finding new junctions, which are the strengths of 
breseq. We've experimented with using SMALT in the past. However, when I last 
checked, it did some things like not align indels the same way depending on 
what strand the read was from, that would have made it necessary to 
post-process the alignments. So, it didn't work as an easy drop-in.

Our strategy is probably going to be to use BWA to align perfect matches with 
high mapping scores first and then align the remaining reads with SSAHA2.

--Jeff

Original comment by jeffrey....@gmail.com on 26 May 2012 at 4:27

GoogleCodeExporter commented 8 years ago

Re: FASTQ

breseq should be able to recognize the format of your input FASTQ correctly, no 
matter how it starts. It then converts each FASTQ in the 01_sequence_conversion 
directory to SANGER format, including renaming the reads (because sometimes 
paired end reads have the same name, which confuses things, or other name 
formatting can cause problems for SSAHA2 alignment).

Re: @SEQ_ID and +SEQ_ID lines

The test data packaged with breseq that we use for consistency tests doesn't 
repeat the name on the + line. See tests/data/*. So, something else must be 
causing your problem. Let us know an we'll look into it.

Original comment by jeffrey....@gmail.com on 26 May 2012 at 4:32

GoogleCodeExporter commented 8 years ago

Forgot to mention that if there are certain parts of the pipeline that you'd 
like us to expose so that you can use your own aligned reads in BAM format, 
then that's certainly possible. We could probably do any part except the 
junction prediction part, which requires a subsequent alignment step with 
SSAHA2.

Original comment by jeffrey....@gmail.com on 26 May 2012 at 5:31

GoogleCodeExporter commented 8 years ago

Hi Jeff,

thanks for your very informative reply.

Re: @SEQ_ID and +SEQ_ID lines

We are looking into that FASTQ header problem... maybe the header wasn't the 
problem, although the first obvious difference between the two files. Maybe it 
wasn't a FASTQ problem but a memory/disk space problem...

Re: SMALT

I see... 

Re: Hybrid mapping

I would like to point to you to STAMPY 
(http://genome.cshlp.org/content/21/6/936), which already has a hybrid mapping 
approach in mind, i.e. you can pass BWA as a premapper. STAMPY itself is better 
at mapping diverged sequences. Maybe it might be useful for you as an 
alternative to SSAHA2.

We are waiting for our additional sequence data and will report back if we run 
into problems.

Cheers,

Markus

Original comment by 000.Cala...@googlemail.com on 29 May 2012 at 10:05

GoogleCodeExporter commented 8 years ago

We have added an ability to use SAM files aligned by other programs to the next 
version 0.19.

Original comment by jeffrey....@gmail.com on 22 Jun 2012 at 1:46

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

Great work! 

Markus

Original comment by 000.Cala...@googlemail.com on 26 Jun 2012 at 10:36

GoogleCodeExporter commented 8 years ago

If you'd like to try it out before we release a new "official" version you can 
download from here:

http://barricklab.org/release/breseq/breseq-0.18b.tar.gz

To use SAM files as input pass the flag (--aligned-sam). It will assume that 
you are inputing SAM files of aligned reads in place of where the FASTQ files 
would normally be provided.

Example:

breseq --aligned-sam -r reference aligned_reads.1.sam aligned_reads.2.sam ...

Original comment by jeffrey....@gmail.com on 26 Jun 2012 at 1:03

shane-c-lawson / breseq

SMALT instead of SSAHA2? #40