jeffdaily / parasail-python

Python bindings for the parasail C library.
Other
90 stars 17 forks source link

Traceback/Local Alignment String Output Help #27

Closed jihoonk1495 closed 5 years ago

jihoonk1495 commented 6 years ago

Hello, I was trying to replicate the local alignment explained on the Wikipedia page for Smith–Waterman algorithm:

custMatrix = parasail.matrix_create("ACGT", 3, -3)

seq1 = "TGTTACGG" seq2 = "GGTTGACTA" locAl = parasail.sw_trace(seq1, seq2, 2, 2, custMatrix) locCigar = locAl.cigar

How can I use locAl or locCigar to return just this alignment as a string:

GTT-AC
||| ||
GTTGAC

Is there an string output function I am missing?

jeffdaily commented 6 years ago

Good question. In the C library, there is a function parasail_traceback_generic_extra that takes a FILE* stream. There's no good way to wrap that in Python. But there is another function parasail_traceback_generic that prints to stdout and can handle different representations. Would that be useful? I'm not sure how that would interact with the output of your program, mixing Python output and C library stdout output.

If you want to see the types of output formats supported by these functions, look at the output of the C library's parasail_aligner application. It offers a few different ways of representing the alignment to the screen.

If I were to attempt a new feature, unfortunately your question is slightly too generic. What should the returned string look like? GTT-AC\n||| ||\nGTTGAC ? Is this a useful representation? Or would you want seq1, the alignment characters, and seq2 output as separate strings? There are so many variations, that's why I'm pressing for more detail.

jihoonk1495 commented 6 years ago

I am using parasail-python to identify potential motifs in a set of sequences; ideally, I wanted to run local alignment on multiple sequences simultaneously, but as far as I am aware, parasail performs only pairwise alignment. As an alternative, I am running multiple pairwise alignments in a set of sequences, which I believe may share a motif. So I was looking to check local alignment string by eye. I only wrote the example output string GTT-AC\n||| ||\nGTTGAC from experience with Biopython (Bio import pairwise2, Bio.pairwise2 import format_alignment), though it also includes, the un-aligned characters, which I do not want. For my purposes, I believe: GTT-AC\n GTTGAC should be sufficient.

As I am not as familiar with C, which might you recommend to get representation, as described above?

jeffdaily commented 6 years ago

You're right that parasail is only pairwise. What you're asking for does not yet exist in the C library or in the python wrappers. I'll mark this as a feature request.

jihoonk1495 commented 6 years ago

Thank you!

jeffdaily commented 5 years ago

@jihoonk1495 Ok, I'll need you to try out the new functionality to make sure it's what you wanted. I created a traceback function that returns a struct of the three strings representing the alignment. In python, it looks like this:

>>> import parasail
>>> r = parasail.sw_trace("asdf", "asdc", 10, 1, parasail.blosum62)
>>> r.traceback.ref
'asd'
>>> r.traceback.comp
'|||'
>>> r.traceback.query
'asd'
>>> r = parasail.sw_trace("asdf", "asxdf", 10, 1, parasail.blosum62)
>>> r.traceback.ref
'sxdf'
>>> r.traceback.comp
':.||'
>>> r.traceback.query
'asdf'
>>> t = r.get_traceback('=', '+', '-')
>>> t.ref
'sxdf'
>>> t.comp
'+-=='
>>> t.query
'asdf'

The default parameters to get_traceback(mch='|', sim=':', neg='.') can be changed if you call the method. Otherwise, the traceback property will return the default traceback. The returned query and ref strings will contain dashes to indicate indels.

Let me know if that's what you wanted. If so, I will tag a new C library release as well as a new python release. If you don't want to test it, but you otherwise approve of the interface, I'll cut the new releases as-is.

To test the new functionality, you would need to build the C library yourself from the develop branch. These python wrappers are already part of the master branch but the corresponding C functionality that these wrappers depend on are only part of the C library's develop branch.

jihoonk1495 commented 5 years ago

@jeffdaily

Before I try out the new functionality, I was wondering "neg" signifies. Does it indicate the particular character to be part of the optimal substring? In the above example between sxdf and asdf, what differentiates s - x (position 2) , from s - a (position 1)?

jeffdaily commented 5 years ago

"neg" signifies a substitution where the substitution matrix indicated a negative score. You are allowed to have different characters for indicating an exact match, a positive-valued substitution, and a negative-valued substitution; as opposed to indels.

jeffdaily commented 5 years ago

Releasing 1.1.13 today with these features.