jeffdaily / parasail

Pairwise Sequence Alignment Library
Other
241 stars 34 forks source link

DNA Substitution Matrices #24

Closed rmhubley closed 6 years ago

rmhubley commented 8 years ago

Currently the code doesn't allow matrix specification for DNA searches. Would it be an easy feature to add?

jeffdaily commented 8 years ago

I'm sure I could add DNA substitution matrices. There is a function in the C API that lets you create a substitution matrix given an alphabet and match/mismatch scores. Are there some common DNA substitution matrices that should be included? I assumed, perhaps incorrectly, that DNA was scored based on a simple match/mismatch score and not anything more complicated like BLOSUM. Please let me know and I will add anything missing.

rmhubley commented 8 years ago

In many species there is big difference between the mutagenicity of base pairings due, in part to the high rate of methylated cytosines to thymine transitions. This is the strongly the case for mammals. It is not sufficient to use transformations of the protein derived matrices either ( BLOSUM etc ) as they are derived from sequences under selective pressure and codon bias. Anyone searching for neutrally evolving sequences would not be best served by them. The RepeatMasker package comes with a set of DNA matrices parameterized on divergence and GC background that you are welcome to but I recommend the software support the "-m" flag in much the same way FASTA, Phrap/Cross_match and RMBlast does.

jeffdaily commented 8 years ago

If you could, please try the 'feature/dna' branch. I recently added the ability to parse matrix files of the form:

 # FREQS A 0.325 C 0.175 G 0.175 T 0.325
     A   R   G   C   Y   T   K   M   S   W   N   X
 A   8   3  -7 -17 -19 -21 -14  -4 -12  -6  -1 -30
 R   0   3   2 -16 -18 -19  -8  -8  -7 -10  -1 -30
 G -10  12  12 -16 -17 -18  -2 -13  -1 -14  -1 -30
 C -18 -17 -16  12  12 -10 -13  -2  -1 -14  -1 -30
 Y -19 -18 -16   2   0   0  -8  -8  -7 -10  -1 -30
 T -21 -19 -17  -7   3   8  -4 -14 -12  -6  -1 -30
 K -15  -9  -2 -11  -8  -4  -3 -13  -7 -10  -1 -30
 M  -4  -8 -11  -2  -9 -15 -13  -3  -7 -10  -1 -30
 S -14  -8  -1  -1  -8 -14  -8  -8  -1 -14  -1 -30
 W  -6  -9 -12 -12  -9  -6  -9  -9 -12  -6  -1 -30
 N  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1 -30
 X -30 -30 -30 -30 -30 -30 -30 -30 -30 -30 -30 -30

Comments are ignored and can appear as an entire line or end of a line. The character order of the alphabet header row must match the alphabet column (i.e., symmetric).

Note that parasail will map any unknown character from the input sequence(s) to the last row or column of the given substitution matrix.

Please let me know if this is what you were hoping for in your feature request. If it works for you, then I will put it into the next release.

Elmohound commented 6 years ago

I am having problems invoking the function parasail_matrix_from_file on windows, but it works just fine under Linux. I trace this to the function parasail_open in io.c, whereby the fread call on this line returns 1.

if (fs.st_size != fread(buf, fs.st_size, 1, fd)) { ....

If I reverse the second and third arguments (where size becomes 1 and count is fs.st_size), then the execution proceeds normally. This is probably a windows-only problem.

https://stackoverflow.com/questions/11717120/unexpected-return-value-from-fread https://stackoverflow.com/questions/15440671/fread-return-value-1-byte-even-though-file-is-getting-read-completely

My question is whether my modification to parasail_open will break parasail_aligner on Mac or Linux or generally cause other problems (like missing values read from the matrix)?

jeffdaily commented 6 years ago

Thank you @Elmohound for the good details. I will look into this.

jeffdaily commented 6 years ago

Based on the documentation for fread() it looks like I transposed the arguments, just like you pointed out in the stackoverflow article. Looks like a bugfix is needed.