ostafen / smart

String Matching Algorithms Research Tool
https://smart-tool.github.io/smart/
GNU General Public License v3.0
4 stars 2 forks source link

Guidance for algorithm implementation. #54

Open nishihatapalmer opened 1 year ago

nishihatapalmer commented 1 year ago

We should define what the expected implementation of a smart algorithm looks like. Here's a few things it would be good to standardize and document. There are probably others.

Memory allocation

Memory de-allocation

Setting initial values

No need for null terminators

Must be able to handle arbitrary length patterns

Pattern validation

Error return codes

nishihatapalmer commented 1 year ago

Modification of text buffer

nishihatapalmer commented 1 year ago

Detection of text modification

I think I will put a text buffer modification test into the benchmark code as well as the test code, and mark algorithms that do it.

Knowing which algorithms use this technique would be a useful data point to bear in mind when comparing performance and evaluating suitability for an application.

nishihatapalmer commented 1 year ago

Large array allocations

nishihatapalmer commented 1 year ago

Obtaining good benchmark results Identify known sources of bench-marking variation external to the algorithm - e.g. process variation, cache priming, other OS tasks, performance profile / power management settings in OS or BIOS.

Quantify them where possible and give guidance on reducing the variation.

Define measures of good performance We currently just say the algorithm with the fastest mean time is the fastest, which is defensible but not always very useful. It's the one that smart marks out from the rest.

The variation in performance is also an important measure. If we have one algorithm that is a tiny bit faster than another, but has a much wider variation than the other, it is arguably not the best choice for many purposes.

We also have median statistics now, and maybe confidence intervals at some point.

nishihatapalmer commented 1 year ago

Use clear, structured programming Avoid the use of jump instructions (i.e. goto) in the algorithm code and any other coding optimizations that make the implementation unclear.

Note that certain algorithms use specific control flow structures that are not provided in the C language, for example, the while..else construct. If an algorithm requires goto into order to implement the algorithm as designed, then this is acceptable in the absence of a structured programming construct. Otherwise additional values would have to be set and tested for, and this would bias the performance of the algorithm negatively.

nishihatapalmer commented 1 year ago

Keep variants to a minimum Keep the number of variant implementations of an algorithm to a minimum. It's OK to have multiple implementations of an algorithm where necessary. For example, one for each length of q-gram it processes, as that strongly affects the performance of the hash algorithm.

Tuning For algorithms with a large parameter space, avoid publishing lots of different implementations each with slightly different parameter variants, with no guidance as to which ones should be used in which circumstances. If you know that, for example, a particular set of parameters works well on low alphabet searches, it's fine to provide those implementations separately as being tuned for that purpose.

In general Pick a good general set of default values for algorithm parameters and provide written guidance on tuning them further. The algorithm could also auto-tune itself, which would not create new implementations. Only provide variant implementations where you can clearly explain their purpose.

nishihatapalmer commented 1 year ago

Avoid use of pointer arithmetic It is possible to make search algorithms faster by using pointer arithmetic rather than indexing into the text. For example, instead of having an int search position pos and using y[pos], using an unsigned char *pos = y, incrementing that and accessing the text using *pos directly leads to faster execution time generally.

However, almost all algorithms in smart use indexing, with the exception of certain algorithms like EPSM that use SIMD intrinsics.

In order to achieve a fair comparison of the algorithms, it is recommended that pointer arithmetic is not used in the algorithms unless there is an unavoidable reason to do so.