mbreese / swalign

Smith-Waterman local aligner
Other
67 stars 22 forks source link

Documentation & Examples #1

Closed sholsapp closed 4 years ago

sholsapp commented 11 years ago

This module needs example use cases and documentation.

I can't figure out how to use it.

mbreese commented 11 years ago

Yes, it does... This is a pretty standard Smith/Waterman style sequence alignment tool. It is useful for aligning DNA or amino acid sequences. For some of my work, it's helpful to have a pure python version of a tool that does this, particularly one where you can have lots of control over the penalty values.

Check out the bin/swalign script for some clues. You need to create a LocalAlignment object. This needs to know the scoring matrix, and penalty values to use.

After this, you can create alignments using the align method. You need to provide the query sequence and the target sequence. The alignment object has some key values, like percent_identity, a score, a CIGAR representation of the alignment, etc. You can also dump the alignment to stdout where it will print out something close to what BLAST returns.

Or, you can use the bin/swalign script that's included to manage most of this for you, if you want to align sequences from FASTA files. If the package is installed with pip, you get the swalign script setup for you.

Hopefully this helps.

Examples...

import swalign
sw = swalign.LocalAlignment(
    swalign.NucleotideScoringMatrix(match, mismatch),
    gap_penalty, gap_extension_penalty, gap_extension_decay)

aln = sw.align(r_seq, q_seq, ref_name, query_name)
aln.dump()
sholsapp commented 11 years ago

Ah, it'd be great to add that to the README file. That is helpful to get me started.

Do you have any examples/tutorials for writing scoring matrices?

mbreese commented 11 years ago

For DNA, you'd set match and mismatch values.

Common values would be match=2, mismatch=-1 or match=1, mismatch=-1. A match would add to the score and a mismatch would remove from it. The swalign.NucleotideScoringMatrix class takes these values in __init__.

For proteins, you'd want to use something like PAM50 or BLOSUM62, which are substitution matrices based on amino acid similarity. They can be downloaded from NCBI, or many other places. The class swalign.ScoringMatrix will read in a file like this.

sholsapp commented 11 years ago

Ahh. The algorithm assumes I'm using DNA sequences. I'm planning on adjusting the algorithm to work over arbitrary sequences. Do you know of any implementations that allow me to check arbitrary sequences (or tokens)?

I found the docs, but would suggest adding the examples to the docstrings of the classes so that people using ipython or using the library from PyPI can figure out how to use the library.

mbreese commented 11 years ago

What are you aligning? That will dictate the scoring matrix. Blosum 62 is pretty common for proteins.

http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html

On Jun 7, 2013, at 2:52 PM, Stephen Holsapple notifications@github.com wrote:

After some reading, I think I'm for "BLAST" scoring matrix?

— Reply to this email directly or view it on GitHubhttps://github.com/mbreese/swalign/issues/1#issuecomment-19134987 .

sholsapp commented 11 years ago

I want to align tokens of my own "language". In a nutshell, I'm consuming a stream of events that I distill down into a token of my "language". From the stream, I produce a sequence of these tokens. I want to use this algorithm to detect sequence similarity between two different words of my "langauge". E.g., I want to align user activity (sequence of activity tokens) to determine intent (a pre-built sequence of tokens).

The big thing that I'm trying to figure out how to change is how to treat the scoring matrix in a generate fashion. I don't want to limit my algorithm to DNA bases or protein nucleotides.

mbreese commented 11 years ago

Are the tokens single characters? If not, I'm not sure my code would work - but it might.

If they are single characters, the it should work. You'd have to construct a scoring matrix with the appropriate format though. Look at the doc for the ScoringMatrix class.

On Jun 7, 2013, at 3:18 PM, Stephen Holsapple notifications@github.com wrote:

I want to align tokens of my own "language". In a nutshell, I'm consuming a stream of events that I distill down into a token of my "language". From the stream, I produce a sequence of these tokens. I want to use this algorithm to detect sequence similarity between two different words of my "langauge". E.g., I want to align user activity (sequence of activity tokens) to determine intent (a pre-built sequence of tokens).

The big thing that I'm trying to figure out how to change is how to treat the scoring matrix in a generate fashion. I don't want to limit my algorithm to DNA bases or protein nucleotides.

— Reply to this email directly or view it on GitHubhttps://github.com/mbreese/swalign/issues/1#issuecomment-19135980 .

sholsapp commented 11 years ago

I can assume that for hte simple case they will be single chars, but I might want to fix that for the long run.

I'll do that. When building the scoring matrix, any docs/research I might want to consider that you're aware of?

mostafahasanin commented 8 years ago

when I run the code this error appeared ImportError: No module named 'swalign' can you help me ?

mbreese commented 8 years ago

Can you tell me how you installed and ran it? What command did you type?

Marcus

On Feb 6, 2016, at 3:15 PM, mostafahasanin notifications@github.com wrote:

when I run the code this error appeared ImportError: No module named 'swalign' can you help me ?

— Reply to this email directly or view it on GitHub.

mostafahasanin commented 8 years ago

thank you mbreese , I have run the code correctly
but , I think that the code it's not efficient
because if the sequence is large it will take more time to run

mbreese commented 8 years ago

If the sequences are very large, it will be very inefficient. The library is more useful for quick tests, exploring the alignment algorithms, and small sequences. It’s not really meant for genome-scale work. How big are the sequences?

On Feb 7, 2016, at 7:29 AM, mostafahasanin notifications@github.com wrote:

thank you mbreese , I have run the code correctly

but , I think that the code it's not efficient

because if the sequence is large it will take more time to run

— Reply to this email directly or view it on GitHub.

mostafahasanin commented 8 years ago

I am using DNA sequences and it's very large

akanksha2016 commented 7 years ago

when I run the code this error appeared ImportError: No module named 'swalign' on installing : pip install git+https://github.com/mbreese/swalign/ Getting error: could not create '/usr/local/p3_env/lib/python3.5/site-packages/swalign': Permission denied Can u pls help?

mbreese commented 7 years ago

It looks like you don't have permissions to install anything on that computer. You'll either need root permissions to install, or (recommended) install in a user-local directory.

Look for an option in pip like -u or --user.

On Sep 4, 2017, at 5:44 AM, Akanksha notifications@github.com wrote:

when I run the code this error appeared ImportError: No module named 'swalign' on installing : pip install git+https://github.com/mbreese/swalign/ Getting error: could not create '/usr/local/p3_env/lib/python3.5/site-packages/swalign': Permission denied Can u pls help?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

imsanka commented 6 years ago

Hi, I am trying to use the swalign to align the amino acids sequences. When I ran it, the alignment seems to give the score based on the default matrix. What's the matrix? If I want to add a new matrix, do you have a recommendation? Thank you.

Cheers, IS

t0mj0nes commented 4 years ago

The simple examples given do not work out of the box:

AttributeError: module 'swalign' has no attribute 'LocalAlignment'

t0mj0nes commented 4 years ago

Follow-up: It works if you manually build/install in the installation directory only (Win10).

How can I capture the output, rather than just sending it to stdout?