noporpoise / seq-align

Fast, portable C implementations of Needleman-Wunsch and Smith-Waterman sequence alignment
94 stars 39 forks source link

Pull request #2

Closed AndreaCampagner closed 9 years ago

AndreaCampagner commented 9 years ago

I implemented and modified the experimental options (--gaps_only_at_ends, etc.) and also corrected some bug related to already implemented options (--nomismatch, etc.). Also added a sample test set (containing a dozen of simple tests, primarily related to the experimental options).

noporpoise commented 9 years ago

Hi @AndreaCampagner

Some of this is really good, though I have some concerns. This pull request (PR) adds several things at once. It is better to have a pull request for each change e.g.: one for adding tests, one for fixing a bug, one for adding a new feature. Merging multiple changes in a PR makes it hard to figure out why certain changes are being made (e.g. is an edit fixing a bug or adding a feature?).

You also have many commits, and lots of them have very short or duplicate commit messages. You can merge commits together into a single commit that does a single thing (e.g. adds a feature). This makes it easier to cherry pick commits that I want to merge. Longer commit messages would also be helpful. For instance, I'm not sure what commit bcaa89cb037c7db0d8a8bec3e2e884ca95e64112 is doing - I'm not sure what the original bug was.

Lastly you seem to have included binary files in some of your commits. This makes the git history quite large. Since cloning a repo downloads all the history, users will download all your previous binary files that they will never see or use. It's best to create new commits without those binary files in the history.

Hope these comments are helpful for future - I don't mean to discourage your work.

Onto some specific comments:

  1. Adding options to scoring. If we are changing the API, I think we can be clearer and more flexible:

    // Currently
    void scoring_init(scoring_t* scoring,
                     int match, int mismatch,
                     int gap_open, int gap_extend,
                     char no_start_gap_penalty, char no_end_gap_penalty,
                     char no_gaps_in_a, char no_gaps_in_b,
                     char gaps_ends_a, char gaps_ends_b,
                     char no_mismatches, char case_sensitive);
    
    // I propose (would give more flexibility):
    void scoring_init(scoring_t* scoring,
                     int match, int mismatch,
                     int gap_open, int gap_extend,
                     bool no_start_gap_penalty_a, bool no_end_gap_penalty_a,
                     bool no_end_gap_penalty_b, bool no_end_gap_penalty_b,
                     bool no_gaps_in_a, bool no_gaps_in_b,
                     bool no_mismatches, bool case_sensitive);
  2. Testing. You've added a GPLv3 library, which is more restrictive than public domain. I've got some code to replace the test library you chose (cutest), so I'll probably swap that out.
  3. Could you be clear what the bug in commit bcaa89cb037c7db0d8a8bec3e2e884ca95e64112 is?

Many thanks, Isaac

noporpoise commented 9 years ago

I'm not sure I understand what gaps_ends_a/gaps_ends_b represents? Can't see any comments describing it's meaning.

AndreaCampagner commented 9 years ago

Hi, first i want to thank you for your advices, since this is actually the first project to which i contribute (I'm doing this for a project at my University) I'm not very well versed in doing these kind of things so I'm very glad for these comments of yours. About your points:

    ----acgt
    aaaaacg-

But, according to what my professor said, no gaps should be allowed in the longer sequence.

Thanks, Andrea

noporpoise commented 9 years ago

Ah, I understand the issue. I've implemented --nogapsin1 (variable no_gaps_in_a) to mean no gaps WITHIN the sequence. I'm allowing gaps either side. This allows the maximum alignment when sliding two sequences over each other. So the following is the correct alignment with --nogaps:

ataagacg- -------acgt

Perhaps we could add an options for --nogapsinlongest to add the behaviour your professor is after, instead giving the alignment:

ataagacg ---acgt---

noporpoise commented 9 years ago

My recent commits have added tests, implemented the Longest Common Substring command (bin/lcs) and fixed some bugs when using options --nomismatches and --nogaps. I think the only outstanding issue from this pull request is a feature request to add a --nogapsinlongest option. Therefore I'm closing this PR and will open a feature request as an issue.

My most recent commit added a new library to seq-align, so to update your version you'll need to run:

git pull
cd libs && make && cd ..
make

Thanks for your input and I hope seq-align has been useful. Please let me know if there are other issues, Isaac

AndreaCampagner commented 9 years ago

Thanks for clarifying (and for the advice on how to update my version). Many thanks, Andrea