what type of audio do you consider more complex than music? Basically it works for all kinds of audio, music is just the most obvious application
Yes it's basically an FFT with peak detection and then matching them up... all 4 algorithms work differently and for a detailed explanation I suggest reading the papers (titles are listed in the readme)
Music is usually composed of instruments with distinctive timbres but most importantly a specific rhythm pattern which can easily decomposed into peak distances over a spectrogram. The same doesn't apply for something like distinguishing a blender from a beater in a noisy kitchen.
Looking for peaks with FFT is not the same as STFT or CWT. I know that some of the algorithms in your list might work if implemented correctly, I'm just wondering if you tested something like this before. I'm choosing between three libraries and it'd be nice to have some validation before diving in.
As long as the audio is changing it should be possible to find peaks, and the algorithms are customizable by many parameters that can be adjusted to various use-cases. I have never tried it with such sounds myself though. I recommend building AudioAlign, throwing two test files in there, and seeing what happens. Generally, fingerprinting is designed to find recording of the same source, not similar sources, i.e. two different recordings of the same blender at the same time should match, while 2 recordings of the same blender at different times might not (since the blender isn't deterministic in always producing the exact same sound)
When talking about FFT in fingerprinting, it's basically always STFT, because we want to analyze how the audio changes over time and use these clues to generate patterns to compare. Just taking a simple FFT of a whole input file wouldn't really work as we only get the average spectrum of the input.