Open nishihatapalmer opened 1 year ago
Modification of text buffer
#define NUM_PATTERNS_AT_END_OF_TEXT 2
.Detection of text modification
I think I will put a text buffer modification test into the benchmark code as well as the test code, and mark algorithms that do it.
Knowing which algorithms use this technique would be a useful data point to bear in mind when comparing performance and evaluating suitability for an application.
Large array allocations
A[m][SIGMA]
.Obtaining good benchmark results Identify known sources of bench-marking variation external to the algorithm - e.g. process variation, cache priming, other OS tasks, performance profile / power management settings in OS or BIOS.
Quantify them where possible and give guidance on reducing the variation.
Define measures of good performance We currently just say the algorithm with the fastest mean time is the fastest, which is defensible but not always very useful. It's the one that smart marks out from the rest.
The variation in performance is also an important measure. If we have one algorithm that is a tiny bit faster than another, but has a much wider variation than the other, it is arguably not the best choice for many purposes.
We also have median statistics now, and maybe confidence intervals at some point.
Use clear, structured programming Avoid the use of jump instructions (i.e. goto) in the algorithm code and any other coding optimizations that make the implementation unclear.
Note that certain algorithms use specific control flow structures that are not provided in the C language, for example, the while..else construct. If an algorithm requires goto into order to implement the algorithm as designed, then this is acceptable in the absence of a structured programming construct. Otherwise additional values would have to be set and tested for, and this would bias the performance of the algorithm negatively.
Keep variants to a minimum Keep the number of variant implementations of an algorithm to a minimum. It's OK to have multiple implementations of an algorithm where necessary. For example, one for each length of q-gram it processes, as that strongly affects the performance of the hash algorithm.
Tuning For algorithms with a large parameter space, avoid publishing lots of different implementations each with slightly different parameter variants, with no guidance as to which ones should be used in which circumstances. If you know that, for example, a particular set of parameters works well on low alphabet searches, it's fine to provide those implementations separately as being tuned for that purpose.
In general Pick a good general set of default values for algorithm parameters and provide written guidance on tuning them further. The algorithm could also auto-tune itself, which would not create new implementations. Only provide variant implementations where you can clearly explain their purpose.
Avoid use of pointer arithmetic
It is possible to make search algorithms faster by using pointer arithmetic rather than indexing into the text. For example, instead of having an int search position pos
and using y[pos]
, using an unsigned char *pos = y
, incrementing that and accessing the text using *pos
directly leads to faster execution time generally.
However, almost all algorithms in smart use indexing, with the exception of certain algorithms like EPSM that use SIMD intrinsics.
In order to achieve a fair comparison of the algorithms, it is recommended that pointer arithmetic is not used in the algorithms unless there is an unavoidable reason to do so.
We should define what the expected implementation of a smart algorithm looks like. Here's a few things it would be good to standardize and document. There are probably others.
Memory allocation
Memory de-allocation
Setting initial values
No need for null terminators
Must be able to handle arbitrary length patterns
Pattern validation
#define
that provides a completely standard validation function?Error return codes
#define CANNOT_SEARCH -1 // already exists, but use the explicit name for clarity.
#define ERROR_SEARCHING -2 // use this instead of exit(1) when an unrecoverable error is detected.