Bad decompression performance on repeating data with short period (< 32)

mtopolnik commented 9 years ago

I've studied the implementation of wildIncrementalCopy and found that it has a suboptimal approach to copying ranges which are close to each other (source and destination offset differ by 32 bytes or less). This manifests in bad decompression performance when input data has a short byte sequence which repeats many times.

I have written an implementation of wildIncrementalCopy (specialized for little-endian architecture, but easily adaptable to big-endian) which takes a different approach to copying narrow ranges. I submit it below should there be any interest to make use of it.

I have also slightly improved the usage of wild incremental copying such that the fallback to safe incremental copying happens only for the range where it's truly necessary:

if (matchCopyEnd > destEnd - COPY_LENGTH) {
    if (matchCopyEnd > destEnd) {
        throw new LZ4Exception("Malformed input at " + sOff);
    }
    int wildCopyEnd = destEnd - COPY_LENGTH;
    int wildCopyLen = wildCopyEnd - dOff;
    if (wildCopyLen >= 8) {
        wildIncrementalCopyLE(dest, matchOff, dOff, wildCopyEnd);
    } else {
        wildCopyEnd = dOff;
        wildCopyLen = 0;
    }
    safeIncrementalCopy(dest, matchOff + wildCopyLen, wildCopyEnd, matchCopyEnd - wildCopyEnd);
} else {
    wildIncrementalCopyLE(dest, matchOff, dOff, matchCopyEnd);
}

With these changes my JMH measurements of the decompression of short-period repeating data show close to 4x speedup over the current Unsafe impl and 2x speedup over the native impl. Funnily though, the Safe impl fares very well on this test and is significantly faster than native, although not as fast as the one I'm providing.

static void wildIncrementalCopyLE(byte[] buf, int srcStart, int destStart, int destEnd) {
    final int distance = destStart - srcStart;
    if (distance > 32) {
        wildArraycopy(buf, srcStart, buf, destStart, destEnd - destStart);
    } else if (distance == 8) {
        final long val = readLong(buf, srcStart);
        for (int off = destStart; off < destEnd; off += 8) {
            writeLong(buf, off, val);
        }
    } else if ((srcStart & 7) == (destStart & 7)) {
        wildArraycopy(buf, srcStart, buf, destStart, destEnd - destStart);
    } else if (distance > 8) {
        wildIncrementalMisalignedCopyLE(buf, srcStart, destStart, destEnd);
    } else {
        wildIncrementalNarrowCopyLE(buf, srcStart, destStart, destEnd);
    }
}

static void wildIncrementalMisalignedCopyLE(byte[] buf, int srcStart, int destStart, int destEnd) {
    assert destStart - srcStart > 8 && (srcStart & 7) != (destStart & 7);
    final int bytesToAlignDest = -destStart & 7;
    if (bytesToAlignDest != 0) {
        final long mask = -1L >>> 8 * bytesToAlignDest;
        final int destStartAligned = destStart & ~7;
        final long srcVal = readLong(buf, srcStart - (destStart & 7)) & ~mask;
        final long destVal = readLong(buf, destStartAligned) & mask;
        writeLong(buf, destStartAligned, srcVal | destVal);
        destStart = destStartAligned + 8;
        srcStart += bytesToAlignDest;
    }
    final int alignRightShiftBytes = srcStart & 7;
    final int alignLeftShiftBytes = (8 - alignRightShiftBytes) & 7;
    final int alignRightShift = 8 * alignRightShiftBytes;
    final int alignLeftShift = 8 * alignLeftShiftBytes;
    int alignedReadOff = srcStart & ~7;
    long val1 = readLong(buf, alignedReadOff) >>> alignRightShift;
    for (int writeOff = destStart; writeOff < destEnd; writeOff += 8) {
        alignedReadOff += 8;
        final long val2 = readLong(buf, alignedReadOff);
        writeLong(buf, writeOff, val1 | val2 << alignLeftShift);
        val1 = val2 >>> alignRightShift;
    }
}

static void wildIncrementalNarrowCopyLE(byte[] buf, int srcStart, int destStart, int destEnd) {
    final int distance = destStart - srcStart;
    assert distance < 8;
    long val = readLong(buf, srcStart);
    final int shiftDistance = 8 * distance;
    long maskedVal = val & ~(-1L << shiftDistance);
    if (distance > 3) {
        val <<= shiftDistance;
        val |= maskedVal;
    } else {
        switch (distance) {
            case 3: {
                val = maskedVal;
                maskedVal <<= shiftDistance;
                val |= maskedVal;
                maskedVal <<= shiftDistance;
                val |= maskedVal;
                break;
            }
            case 2: {
                val = maskedVal | maskedVal << shiftDistance;
                val |= val << 2 * shiftDistance;
                break;
            }
            case 1: {
                val = maskedVal | maskedVal << shiftDistance;
                val |= val << 2 * shiftDistance;
                val |= val << 4 * shiftDistance;
            }
        }
    }
    switch (distance) {
        case 4:
        case 2:
        case 1:
            for (int off = destStart; off < destEnd; off += 8) {
                writeLong(buf, off, val);
            }
            break;
        default:
            // slipDistance = 8 * (8 % distance), but avoid the mod operation:
            int slipDistance = distance == 3? 8 * (8 % 3) : 64 - shiftDistance;
            final int wrapDistance = 64 - slipDistance;
            for (int off = destStart; off < destEnd; off += 8) {
                writeLong(buf, off, val);
                val >>>= slipDistance;
                val |= val << wrapDistance;
            }
    }
}

static void wildArraycopy(byte[] src, int srcOff, byte[] dest, int destOff, int len) {
    for (int i = 0; i < len; i += 8) {
        writeLong(dest, destOff + i, readLong(src, srcOff + i));
    }
}

odaira commented 7 years ago

Can I close this issue, as the discussion at lz4/lz4#126 ended two years ago?

mtopolnik commented 7 years ago

The code I submitted here is thoroughly tested for both correctness and performance and it can be swapped into the project at will.

The changes I propose bring more perf to the Java Unsafe implementation. This fact is independent of the work on the native LZ4 project since the Unsafe implementation doesn't rely on it. We (Hazelcast) have no active interest in this at the moment since we decided not to use any compression (modern SSDs perform far too well for compression to provide any performance benefit).

So I'm fine with any decision you make.

lz4 / lz4-java

Bad decompression performance on repeating data with short period (< 32) #72