mbreese / swalign

Smith-Waterman local aligner
Other
67 stars 22 forks source link

swalign with long sequences #6

Closed Francois75012 closed 5 years ago

Francois75012 commented 9 years ago

two small sequences aligned perfectly well with swalign (ex. AGTCAGCCTGTAAACTGTAATATTTCTGATATGCAGATAATGCTTTTGAATAATCTTCCAATAAGAGGTTGAAGTG and CTTCCAATAGTCAGCCTGTAAACTGTAATATTTCTGAT). When it comes to larger sequences, the alignment is not working any more (ex. TGTACCCAGAGGATTTTGGAAACAGTTCAGGCTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTTA and ATTCTTGCAATTGAAGGCACAGGTAGTGATACAGATGATAATATATATGGGGAAGAGAAGCAGGACATTGAGATTAATGTTGATCTTCGAGATGATGTTTTTGGATATCCTCATCAATTTGAAGACAAACCAGCATTAACTAAGACTGAAGATAGAAAAGAGTACAATATTGGTGTCCTAAGACACCTTCAGGTCGTCTTTGGTCATTTAGCTGCTTCCCGACTGCAATACTATGTACCCAGAGGATTTTGGAAACAGTTCAG) Could you explain why ?

mbreese commented 9 years ago

What settings did you use?

What output did you get? What did you expect?

Marcus

On Jun 8, 2015, at 8:35 AM, Francois75012 notifications@github.com wrote:

two small sequences aligned perfectly well with swalign (ex. AGTCAGCCTGTAAACTGTAATATTTCTGATATGCAGATAATGCTTTTGAATAATCTTCCAATAAGAGGTTGAAGTG and CTTCCAATAGTCAGCCTGTAAACTGTAATATTTCTGAT). When it comes to larger sequences, the alignment is not working any more (ex. TGTACCCAGAGGATTTTGGAAACAGTTCAGGCTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTTA and ATTCTTGCAATTGAAGGCACAGGTAGTGATACAGATGATAATATATATGGGGAAGAGAAGCAGGACATTGAGATTAATGTTGATCTTCGAGATGATGTTTTTGGATATCCTCATCAATTTGAAGACAAACCAGCATTAACTAAGACTGAAGATAGAAAAGAGTACAATATTGGTGTCCTAAGACACCTTCAGGTCGTCTTTGGTCATTTAGCTGCTTCCCGACTGCAATACTATGTACCCAGAGGATTTTGGAAACAGTTCAG) Could you explain why ?

— Reply to this email directly or view it on GitHub https://github.com/mbreese/swalign/issues/6.

Francois75012 commented 9 years ago

Thanks for your fast answer Marcus.

I was puzzled by a so much difference in the output between swalign and emboss water.

In swalign, I used the following settings :

match = 2 mismatch = -1 gap_penalty = -1 gap_extension_penalty = -1 gap_extension_decay = 0.0 prefer_gaps = True verbose = False

I expect the same result as "water" gives :

(result from the EMBOSS webservice with the same sequences :

########################################

Program: water

Rundate: Mon 8 Jun 2015 20:13:13

Commandline: water

-auto

-stdout

-asequence emboss_water-I20150608-201312-0800-66302481-pg.asequence

-bsequence emboss_water-I20150608-201312-0800-66302481-pg.bsequence

-datafile EDNAFULL

-gapopen 10.0

-gapextend 0.5

-aformat3 pair

-snucleotide1

-snucleotide2

Align_format: pair

Report_file: stdout

########################################

=======================================

#

Aligned_sequences: 2

1: 1

2: 2

Matrix: EDNAFULL

Gap_penalty: 10.0

Extend_penalty: 0.5

#

Length: 30

Identity: 30/30 (100.0%)

Similarity: 30/30 (100.0%)

Gaps: 0/30 ( 0.0%)

Score: 150.0

# #

=======================================

1 1 TGTACCCAGAGGATTTTGGAAACAGTTCAG 30 |||||||||||||||||||||||||||||| 2 234 TGTACCCAGAGGATTTTGGAAACAGTTCAG 263

---------------------------------------

---------------------------------------

Result from swalign :

Query: 26 TGATA--CAGATGATAATATATATGGGGAAGAGAAG--CAGGACATTGAGATTAA--TGTTGATCTTCGAG-A-----TGATGTTTT 101 || || ||||.||| | | | |||| | .|| |||| |.|||.|.|.|| ||||.|||||||.| | |||||.||| Ref : 0 TG-TACCCAGAGGAT--T-T-T----GGAA-A-CAGTTCAGG-CTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTT 75

Le 08/06/2015 18:59, Marcus Breese a écrit :

What settings did you use?

What output did you get? What did you expect?

Marcus

On Jun 8, 2015, at 8:35 AM, Francois75012 notifications@github.com wrote:

two small sequences aligned perfectly well with swalign (ex. AGTCAGCCTGTAAACTGTAATATTTCTGATATGCAGATAATGCTTTTGAATAATCTTCCAATAAGAGGTTGAAGTG and CTTCCAATAGTCAGCCTGTAAACTGTAATATTTCTGAT). When it comes to larger sequences, the alignment is not working any more (ex. TGTACCCAGAGGATTTTGGAAACAGTTCAGGCTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTTA and ATTCTTGCAATTGAAGGCACAGGTAGTGATACAGATGATAATATATATGGGGAAGAGAAGCAGGACATTGAGATTAATGTTGATCTTCGAGATGATGTTTTTGGATATCCTCATCAATTTGAAGACAAACCAGCATTAACTAAGACTGAAGATAGAAAAGAGTACAATATTGGTGTCCTAAGACACCTTCAGGTCGTCTTTGGTCATTTAGCTGCTTCCCGACTGCAATACTATGTACCCAGAGGATTTTGGAAACAGTTCAG) Could you explain why ?

— Reply to this email directly or view it on GitHub https://github.com/mbreese/swalign/issues/6.

— Reply to this email directly or view it on GitHub https://github.com/mbreese/swalign/issues/6#issuecomment-110073185.

François PIUMI Unité Biologie du Développement et de la Reproduction INRA - Domaine de Vilvert - Jouy en Josas 78352 Cedex Tél: +(33) 1 34 65 21 09

mbreese commented 9 years ago

The difference is in the weights / penalties applied. The algorithms used between the two programs should be the same… EMBOSS water seems to use 10 and 0.5 as the gap open and gap extension penalties.

So, if you use the same penalties, you end up with the same alignment:

$ swalign -gap 10 -gapext 0.5 TGTACCCAGAGGATTTTGGAAACAGTTCAGGCTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTTA ATTCTTGCAATTGAAGG    CACAGGTAGTGATACAGATGATAATATATATGGGGAAGAGAAGCAGGACATTGAGATTAATGTTGATCTTCGAGATGATGTTTTTGGATATCCTCATCAATTTGAAGACAAACCAGCATTAACT    AAGACTGAAGATAGAAAAGAGTACAATATTGGTGTCCTAAGACACCTTCAGGTCGTCTTTGGTCATTTAGCTGCTTCCCGACTGCAATACTATGTACCCAGAGGATTTTGGAAACAGTTCAG
Query: cmdline (263 nt)
Ref  : cmdline (76 nt)

Query: 234 TGTACCCAGAGGATTTTGGAAACAGTTCAG 263
           ||||||||||||||||||||||||||||||
Ref  :   1 TGTACCCAGAGGATTTTGGAAACAGTTCAG 30

Score: 60
Matches: 30 (100.0%)
Mismatches: 0
CIGAR: 30M

For the library version, this would be setup something like this:

LocalAlignment(NucleotideScoringMatrix(2, -1), gap_penalty=-10, gap_extension_penalty=-0.5).align(ref, query)
Francois75012 commented 9 years ago

Thank you very much Marcus, it now worked perfectly well !

Francois

Le 08/06/2015 23:16, Marcus Breese a écrit :

The difference is in the weights / penalties applied. The algorithms used between the two programs should be the same… EMBOSS water seems to use 10 and 0.5 as the gap open and gap extension penalties.

So, if you use the same penalties, you end up with the same alignment:

$ swalign -gap 10 -gapext 0.5 TGTACCCAGAGGATTTTGGAAACAGTTCAGGCTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTTA ATTCTTGCAATTGAAGGCACAGGTAGTGATACAGATGATAATATATATGGGGAAGAGAAGCAGGACATTGAGATTAATGTTGATCTTCGAGATGATGTTTTTGGATATCCTCATCAATTTGAAGACAAACCAGCATTAACTAAGACTGAAGATAGAAAAGAGTACAATATTGGTGTCCTAAGACACCTTCAGGTCGTCTTTGGTCATTTAGCTGCTTCCCGACTGCAATACTATGTACCCAGAGGATTTTGGAAACAGTTCAG

For the library version, this would be setup something like this:

LocalAlignment(NucleotideScoringMatrix(2, -1), gap_penalty=-10, gap_extension_penalty=-0.5).align(ref, query)

On Jun 8, 2015, at 12:17 PM, Francois75012 notifications@github.com wrote:

Thanks for your fast answer Marcus.

I was puzzled by a so much difference in the output between swalign and emboss water.

In swalign, I used the following settings :

match = 2 mismatch = -1 gap_penalty = -1 gap_extension_penalty = -1 gap_extension_decay = 0.0 prefer_gaps = True verbose = False

I expect the same result as "water" gives :

(result from the EMBOSS webservice with the same sequences :

########################################

Program: water

Rundate: Mon 8 Jun 2015 20:13:13

Commandline: water

-auto

-stdout

-asequence emboss_water-I20150608-201312-0800-66302481-pg.asequence

-bsequence emboss_water-I20150608-201312-0800-66302481-pg.bsequence

-datafile EDNAFULL

-gapopen 10.0

-gapextend 0.5

-aformat3 pair

-snucleotide1

-snucleotide2

Align_format: pair

Report_file: stdout

########################################

=======================================

#

Aligned_sequences: 2

1: 1

2: 2

Matrix: EDNAFULL

Gap_penalty: 10.0

Extend_penalty: 0.5

#

Length: 30

Identity: 30/30 (100.0%)

Similarity: 30/30 (100.0%)

Gaps: 0/30 ( 0.0%)

Score: 150.0

# #

=======================================

1 1 TGTACCCAGAGGATTTTGGAAACAGTTCAG 30 |||||||||||||||||||||||||||||| 2 234 TGTACCCAGAGGATTTTGGAAACAGTTCAG 263

---------------------------------------

---------------------------------------

Result from swalign :

Query: 26

TGATA--CAGATGATAATATATATGGGGAAGAGAAG--CAGGACATTGAGATTAA--TGTTGATCTTCGAG-A-----TGATGTTTT

101 || || ||||.||| | | | |||| | .|| |||| |.|||.|.|.|| ||||.|||||||.| | |||||.||| Ref : 0

TG-TACCCAGAGGAT--T-T-T----GGAA-A-CAGTTCAGG-CTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTT

75

Le 08/06/2015 18:59, Marcus Breese a écrit :

What settings did you use?

What output did you get? What did you expect?

Marcus

On Jun 8, 2015, at 8:35 AM, Francois75012 <notifications@github.com mailto:notifications@github.com> wrote:

two small sequences aligned perfectly well with swalign (ex.

AGTCAGCCTGTAAACTGTAATATTTCTGATATGCAGATAATGCTTTTGAATAATCTTCCAATAAGAGGTTGAAGTG and CTTCCAATAGTCAGCCTGTAAACTGTAATATTTCTGAT). When it comes to larger sequences, the alignment is not working any more (ex.

TGTACCCAGAGGATTTTGGAAACAGTTCAGGCTTTGGGGTGAACCTGTTAATCTTCGTGAACAACATGATGCTTTA and

ATTCTTGCAATTGAAGGCACAGGTAGTGATACAGATGATAATATATATGGGGAAGAGAAGCAGGACATTGAGATTAATGTTGATCTTCGAGATGATGTTTTTGGATATCCTCATCAATTTGAAGACAAACCAGCATTAACTAAGACTGAAGATAGAAAAGAGTACAATATTGGTGTCCTAAGACACCTTCAGGTCGTCTTTGGTCATTTAGCTGCTTCCCGACTGCAATACTATGTACCCAGAGGATTTTGGAAACAGTTCAG) Could you explain why ?

— Reply to this email directly or view it on GitHub <https://github.com/mbreese/swalign/issues/6 https://github.com/mbreese/swalign/issues/6>.

— Reply to this email directly or view it on GitHub

<https://github.com/mbreese/swalign/issues/6#issuecomment-110073185 https://github.com/mbreese/swalign/issues/6#issuecomment-110073185>.

François PIUMI Unité Biologie du Développement et de la Reproduction INRA - Domaine de Vilvert - Jouy en Josas 78352 Cedex Tél: +(33) 1 34 65 21 09

— Reply to this email directly or view it on GitHub https://github.com/mbreese/swalign/issues/6#issuecomment-110111486.

— Reply to this email directly or view it on GitHub https://github.com/mbreese/swalign/issues/6#issuecomment-110142677.

François PIUMI Unité Biologie du Développement et de la Reproduction INRA - Domaine de Vilvert - Jouy en Josas 78352 Cedex Tél: +(33) 1 34 65 21 09

Ellmen commented 5 years ago

Is it fair to say this issue is closed?

mbreese commented 5 years ago

It should be, yes.