kosta777 / parallel-genomeseq

Parallelization of popular genome sequencing algorithms
4 stars 1 forks source link

API change to `LocalAligner` #6

Closed huanglangwen closed 4 years ago

huanglangwen commented 4 years ago

I think we should make some improvements for LocalAligner API. Currently, we need 4 steps to actually get the alignment result and we can't get aligned string through public API. Also, we can't specify score function and gap function.

LocalAligner *la = new SWAligner();
la->setFirstSequence(s1);
la->setSecondSequence(s2);
la->calculateScore();
la->getScore();
la->getPos();

So, it might be better to put alignment process (calculateScore) in initialization phase and provide an API to have access to the aligned string (some wrappers of concensus_a/b):

LocalAligner *la = new SWAligner(s1,s2); //default score function
//or use user defined score function
auto scorefunc = [] (char a, char b) {(a==b)?1:-1};
LocalAligner *la = new SWAligner(s1, s2, scorefunc);
cout << la << endl;
la.score;
la.pos;
la.consensus1;
la.consensus2;

As a result, the signature of LocalAligner would be

class LocalAligner {
  public:
    LocalAligner(std::string s1, std::string s2);
    LocalAligner(std::string s1, std::string s2, std::function scorefunc);
    friend std::ostream & operator<<(std::ostream & os, const LocalAligner & foo);
    const double score;
    const int pos;
    const std::string consensus1;
    const std::string consensus2;
};
spaceben commented 4 years ago

Let me push first and then edit my version since I half-addressed this.

spaceben commented 4 years ago

score would just be the highest entry in the similarity matrix, correct? @huanglangwen What is pos?

huanglangwen commented 4 years ago

score would just be the highest entry in the similarity matrix, correct? @huanglangwen What is pos?

Yes, score is the maximum of the matrix and pos is the position of the first matched position in sequenceB.

spaceben commented 4 years ago
0       0       0       0       0       0       0       0       0
0       0       3       1       0       0       0       3       3
0       0       3       1       0       0       0       3       6
0       3       1       6       4       2       0       1       4
0       3       1       4       9       7       5       3       2
0       1       6       4       7       6       4       8       6
0       0       4       3       5       10      8       6       5
0       0       2       1       3       8      *13*     11      9
0       3       1       5       4       6       11      10      8
0       1       0       3       2       7       9       8       7
Maximum is 13 @ (7, 6)
G(GTTGAC)TA        // sequence A
T(GTT-AC)GG        // sequence B

@huanglangwen So pos would be 7?

huanglangwen commented 4 years ago

@spaceben No, it would be 2 (at the beginning of the matched seq).

spaceben commented 4 years ago

Ah understood.

But a 1-based counter (not the usual 0-based)?

huanglangwen commented 4 years ago

Ah understood. ...

@spaceben Currently, it is 1 based. might due to consistency with the POS in SAM file? @hanyao8

hanyao8 commented 4 years ago

Yes, the POS taken from SAM is 1-based