s9e / ShortestCommonSuperstring

Shortest common superstring generator using a greedy approximation algorithm.
MIT License
0 stars 1 forks source link

doesn't give the shortest common superstring always? #1

Closed increpare closed 4 years ago

increpare commented 4 years ago

I ran it on the wordlist for toki pona ( https://en.wikipedia.org/wiki/Toki_Pona ), and it gave me

lasonanpakalamakesikepekenenamakontokisupalisatasokulupuponamusinpinisulitusemelisewimokulelukinwekasijelosuwiletomolinjakiwenpilinsaselomunpanasinasakutemutewasowelilapelupatawanupipimejantenpokililokoselijopenpokawenmonsitelenlawawalojelukaliliputalasamamanimijesunokamaletelontan

which has 282 characters. However the code on this site produces the string

lukinsakesitelenpilinjantenpokamawenkuletelokopenenasinpinijokutelapelasoweliliputalasalesunokasikepekenlawasonamakonlukalupalisamamokulupumolinmusinanpakalamamanimijemutepipimejakiwenpokililoponaseliselosemelisewilesulisupanasasuwitasotawawalojetokitomonsijelonwekamunpatanuwantu

which contains 280 characters and still contains all words (I checked that independently).

(the code I used is

$scs    = new ShortestCommonSuperstring;
echo $scs->getShortest(["a", "akesi", "ala", "alasa", "ale", "ali", "anpa", "ante", "anu", "awen", "e", "en", "esun", "ijo", "ike", "ilo", "insa", "jaki", "jan", "jelo", "jo", "kala", "kalama", "kama", "kasi", "ken", "kepeken", "kili", "kiwen", "ko", "kon", "kule", "kulupu", "kute", "la", "lape", "laso", "lawa", "len", "lete", "li", "lili", "linja", "lipu", "loje", "lon", "luka", "lukin", "oko", "lupa", "ma", "mama", "mani", "meli", "mi", "mije", "moku", "moli", "monsi", "mu", "mun", "musi", "mute", "nanpa", "nasa", "nasin", "nena", "ni", "nimi", "noka", "o", "olin", "ona", "open", "pakala", "pali", "palisa", "pan", "pana", "pi", "pilin", "pimeja", "pini", "pipi", "poka", "poki", "pona", "pu", "sama", "seli", "selo", "seme", "sewi", "sijelo", "sike", "sin", "namako", "sina", "sinpin", "sitelen", "sona", "soweli", "suli", "suno", "supa", "suwi", "tan", "taso", "tawa", "telo", "tenpo", "toki", "tomo", "tu", "unpa", "uta", "utala", "walo", "wan", "waso", "wawa", "weka", "wile"]);

my copy of the cpp code reads

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <climits>
using namespace std;

// Function to calculate maximum overlap in two given strings
int findOverlappingPair(string s1, string s2, string& str)
{
    // max will store maximum overlap i.e maximum length of the
    // matching prefix and suffix
    int max = INT_MIN;
    int m = s1.length();
    int n = s2.length();

    // check suffix of s1 matches with prefix of s2
    for (int i = 1; i <= min(m, n); i++)
    {
        // compare last i characters in s1 with first i characters in s2
        if (s1.compare(m - i, i, s2, 0, i) == 0)
        {
            if (max < i)
            {
                //update max and str
                max = i;
                str = s1 + s2.substr(i);
            }
        }
    }

    // check prefix of s1 matches with suffix of s2
    for (int i = 1; i <= min(m, n); i++)
    {
        // compare first i characters in s1 with last i characters in s2
        if (s1.compare(0, i, s2, n - i, i) == 0)
        {
            if (max < i)
            {
                //update max and str
                max = i;
                str = s2 + s1.substr(i);
            }
        }
    }

    return max;
}

// Function to calculate smallest string that contains
// each string in the given set as substring.
string findShortestSuperstring(string arr[], int n) // no const, no-ref
{

    // run n-1 times to consider every pair
    while (n != 1)
    {
        // to store  maximum overlap
        int max = INT_MIN;

        // to store array index of strings involved in maximum overlap
        int p, q;

        // to store resultant string after maximum overlap
        string res_str;

        for (int i = 0; i < n; i++)
        {
            for (int j = i + 1; j < n; j++)
            {
                string str;

                // r will store maximum length of the matching prefix and
                // suffix. str is passed by reference and store the result
                // string after maximum overlap of arr[i] and arr[j], if any
                int r = findOverlappingPair(arr[i], arr[j], str);

                // check for maximum overlap
                if (max < r)
                {
                    max = r;
                    res_str.assign(str);
                    p = i, q = j;
                }
            }
        }

        // ignore last element in next cycle
        n--;

        // if no overlap, append arr[n] to arr[0]
        if (max == INT_MIN)
            arr[0] += arr[n];
        else
        {
            // copy resultant string to index p
            arr[p] = res_str;

            // copy string at last index to index q
            arr[q] = arr[n];
        }
    }

    return arr[0];
}

// Shortest Superstring Problem
int main()
{
    string arr[] =  {"a", "akesi", "ala", "alasa", "ale", "ali", "anpa", "ante", "anu", "awen", "e", "en", "esun", "ijo", "ike", "ilo", "insa", "jaki", "jan", "jelo", "jo", "kala", "kalama", "kama", "kasi", "ken", "kepeken", "kili", "kiwen", "ko", "kon", "kule", "kulupu", "kute", "la", "lape", "laso", "lawa", "len", "lete", "li", "lili", "linja", "lipu", "loje", "lon", "luka", "lukin", "oko", "lupa", "ma", "mama", "mani", "meli", "mi", "mije", "moku", "moli", "monsi", "mu", "mun", "musi", "mute", "nanpa", "nasa", "nasin", "nena", "ni", "nimi", "noka", "o", "olin", "ona", "open", "pakala", "pali", "palisa", "pan", "pana", "pi", "pilin", "pimeja", "pini", "pipi", "poka", "poki", "pona", "pu", "sama", "seli", "selo", "seme", "sewi", "sijelo", "sike", "sin", "namako", "sina", "sinpin", "sitelen", "sona", "soweli", "suli", "suno", "supa", "suwi", "tan", "taso", "tawa", "telo", "tenpo", "toki", "tomo", "tu", "unpa", "uta", "utala", "walo", "wan", "waso", "wawa", "weka", "wile"};
    int len = (sizeof(arr)/sizeof(*arr));
    cout << "The Shortest Superstring is " << findShortestSuperstring(arr,len) <<"\n";

    return 0;
}

)

JoshyPHP commented 4 years ago

There's a short answer and a more general answer. The short answer is that you inverted the results. This library generates the 280 chars string and the C++ code generates the 282 chars one. So hurray, this library does find a better result in this particular case! :tada:

Now, to answer your question of whether this library always generates the optimal result: it generally does, but there's no guarantee that it always does because that would be prohibitively expensive.

increpare commented 4 years ago

Ahah -_- sorry! It indeed gives the better one. Very nice. Sorry for the false alarm!

Now, to answer your question of whether this library always generates the optimal result: it generally does, but there's no guarantee that it always does because that would be prohibitively expensive.

This being the case, I'd recommend putting the some qualifier along the lines of 'approximate' in the project description/readme somewhere.

Anyway, thanks for making the code available :)

JoshyPHP commented 4 years ago

No worries, I'm happy that anybody even bothers to compare this implementation's results to other algorithms. :+1:

As for describing the algorithm as approximate, I'll think about it. The algorithm was developed independently and I haven't done the math to formally prove its effectiveness (or efficiency, for that matter.) At the heart of it, it's very similar to a greedy approximate algorithm with two main differences:

  1. There is an initial pass that removes fully overlapping strings so that ["ABA", "B"] produces "ABA" instead of "ABAB".

  2. When assembling overlapping strings, substrings whose prefix match their suffix (a loop, if they were presented as a graph) are given priority so that ["ABB", "BBA", "BBB"] produces "ABBBA" instead of "ABBABBB".