ostafen / smart

String Matching Algorithms Research Tool
https://smart-tool.github.io/smart/
GNU General Public License v3.0
4 stars 2 forks source link

Fix algos to allow any size pattern. #55

Closed nishihatapalmer closed 1 year ago

nishihatapalmer commented 1 year ago

Old smart hard coded a pattern length limit of XSIZE, which was set to 4200. New smart can have arbitrarily sized patterns. This PR removes references to XSIZE, and replaces them with m (or a larger value where appropriate).

Methodology All algorithms were tested with XSIZE = 4200, and a pattern length of 4200, to ensure they handle a full size pattern already, and whether they handle a pattern of 4199. Many algorithms seg-fault when m == XSIZE, as they actually need m + 1 elements.

Following analysis of the code to see whether access at position m or beyond occurs, XSIZE is replaced with m, m + 1 or beyond as appropriate.

After the change, the algorithms were retested to ensure they now work with any pattern size. Some algorithms still have other problems which are not resolved by these changes, which are mentioned in the commits and code comments. They will have to be addressed separately.

nishihatapalmer commented 1 year ago

Each algorithm is in a separate commit (except a few where a family is changed at once).

There's quite a lot of algorithms to review, but they mostly all follow a similar pattern. Notes are provided in the commits to explain the reasoning for each.

nishihatapalmer commented 1 year ago

Some algorithm implementations cannot handle a pattern size beyond about 4096, because they allocate a table of size [m][SIGMA]. When m is large (and SIGMA = 256), then it is too big for the heap. Proper fix is to allocate this memory dynamically, but I haven't done that in this body of changes. Instead, I just return -1 for pattern sizes bigger than 4096 as the algorithm implementation currently can't handle them. This avoids crashing smart with a seg-fault.