mengyao / Complete-Striped-Smith-Waterman-Library

298 stars 112 forks source link

Attribution to Heng Li #1

Closed jimhester closed 11 years ago

jimhester commented 11 years ago

Good work on the library, I think it can be a valuable addition which will make it easy to add fast smith waterman alignments to analysis pipelines. However, the lack of attribution to Heng Li in this libraries documentation and the accompanying paper greatly concerns me.

This library seems to be derived from @lh3 (Heng Li)'s smith waterman implementation in bwa and his stand alone ksw. See https://github.com/attractivechaos/klib and https://github.com/lh3/bwa. This is apparent both by casual code perusal, but also by the plagarism detection program moss, http://moss.stanford.edu/results/265225620

From the MIT license included in the ksw release it states

Copyright (c) 2011 by Attractive Chaos <attractor@live.co.uk>

   Permission is hereby granted, free of charge, to any person obtaining
   a copy of this software and associated documentation files (the
   "Software"), to deal in the Software without restriction, including
   without limitation the rights to use, copy, modify, merge, publish,
   distribute, sublicense, and/or sell copies of the Software, and to
   permit persons to whom the Software is furnished to do so, subject to
   the following conditions:

   The above copyright notice and this permission notice shall be
   included in all copies or substantial portions of the Software.

I read this to mean that you are free to do what you want with the software provided you give proper attribution to the source, which you seem to have neglected to do.

I would like to use this library in my work, and I think it can be a valuable tool for the bioinformatics community at large. However if this issue remains unresolved I will not be able to do so in good conscience.

mengyao commented 11 years ago

Thank you very much for your interest and pointing out this issue.

This library (SSW) is not derived from Heng Li's Smith-Waterman (SW) implementation. They do have high similarity at the core part of the SW score matrix calculation. This is because both of them are implementations of Farrar’s algorithm (please see Fig. 5 of Farrar, M., 2007, Striped Smith-Waterman speeds database searches six times over other SIMD implementations, Bioinformatics). The pseudo-code is strictly written in this figure. Most implementations of Farrar’s algorithm, such as the SW implementation in Stampy (Lunter, G. and Goodson, M., 2011, Stampy: a statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res), use similar codes at their core part. This algorithm is difficult to be written in another way, different implementations just add their extra lines to record the score matrix information for trace back. Whatever they do, the codes for matrix calculation are almost exactly the same.

I do read Heng Li’s klib before. When I saw this following macro:

define __max_16(ret, xx) do { \

    (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 8)); \
    (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 4)); \
    (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 2)); \
    (xx) = _mm_max_epu8((xx), _mm_srli_si128((xx), 1)); \
    (ret) = _mm_extract_epi16((xx), 0) & 0x00ff; \
} while (0)

I thought this is a great piece of code that can lead to extra efficiency, so I used this macro to replace the corresponding part in my program. As a student, I also would like to learn from exports.

I appreciate the MIT license issue you pointed out, and apologize for my lacking of experience. I have added the license into the file ssw.c. Hope this can resolve your concerns. Please feel free to let me know if you find any other problem with this library. Thank you sincerely for your encouragement again.