Closed increpare closed 4 years ago
There's a short answer and a more general answer. The short answer is that you inverted the results. This library generates the 280 chars string and the C++ code generates the 282 chars one. So hurray, this library does find a better result in this particular case! :tada:
Now, to answer your question of whether this library always generates the optimal result: it generally does, but there's no guarantee that it always does because that would be prohibitively expensive.
Ahah -_- sorry! It indeed gives the better one. Very nice. Sorry for the false alarm!
Now, to answer your question of whether this library always generates the optimal result: it generally does, but there's no guarantee that it always does because that would be prohibitively expensive.
This being the case, I'd recommend putting the some qualifier along the lines of 'approximate' in the project description/readme somewhere.
Anyway, thanks for making the code available :)
No worries, I'm happy that anybody even bothers to compare this implementation's results to other algorithms. :+1:
As for describing the algorithm as approximate, I'll think about it. The algorithm was developed independently and I haven't done the math to formally prove its effectiveness (or efficiency, for that matter.) At the heart of it, it's very similar to a greedy approximate algorithm with two main differences:
There is an initial pass that removes fully overlapping strings so that ["ABA", "B"]
produces "ABA"
instead of "ABAB"
.
When assembling overlapping strings, substrings whose prefix match their suffix (a loop, if they were presented as a graph) are given priority so that ["ABB", "BBA", "BBB"]
produces "ABBBA"
instead of "ABBABBB"
.
I ran it on the wordlist for toki pona ( https://en.wikipedia.org/wiki/Toki_Pona ), and it gave me
lasonanpakalamakesikepekenenamakontokisupalisatasokulupuponamusinpinisulitusemelisewimokulelukinwekasijelosuwiletomolinjakiwenpilinsaselomunpanasinasakutemutewasowelilapelupatawanupipimejantenpokililokoselijopenpokawenmonsitelenlawawalojelukaliliputalasamamanimijesunokamaletelontan
which has 282 characters. However the code on this site produces the string
lukinsakesitelenpilinjantenpokamawenkuletelokopenenasinpinijokutelapelasoweliliputalasalesunokasikepekenlawasonamakonlukalupalisamamokulupumolinmusinanpakalamamanimijemutepipimejakiwenpokililoponaseliselosemelisewilesulisupanasasuwitasotawawalojetokitomonsijelonwekamunpatanuwantu
which contains 280 characters and still contains all words (I checked that independently).
(the code I used is
my copy of the cpp code reads
)