Closed egg82 closed 1 year ago
trainFromBufferDirect
requires samples and output dict to be direct byte buffer. Can you check you are not providing non-direct byte buffers?
trainFromBufferDirect
requires samples and output dict to be direct byte buffer. Can you check you are not providing non-direct byte buffers?
Somehow I forgot to add a critical line in my original post, the actual samples buffer. Thanks for reminding me! I've also edited the original post with the new line, below.
private final ByteBuf samples = Unpooled.buffer();
Looks like we use directBuffer
instead of buffer
elsewhere in our codebase where we use Zstd.compressDirectByteBufferFastDict
, Zstd.compressDirectByteBuffer
, Zstd.decompressDirectByteBufferFastDict
, and Zstd.decompressDirectByteBuffer
. That might help.
I'll fix that and get back to you with the result, but I wanted to mention something while I do that:
Looks like using Unpooled.directBuffer
and ByteBufUtil.ALLOC.directBuffer
fixed it! Still curious about the segfault while using Zstd.trainFromBuffer
however.
I am not sure why this happens. Here is quick check and it does work as expected:
scala> import scala.io._; import java.nio._; import com.github.luben.zstd.Zstd
...
scala> def source = Source.fromFile("src/test/resources/xml")(Codec.ISO8859).map{_.toByte}
def source: Iterator[Byte]
scala> val src = source.sliding(1024, 1024).take(1024).map(_.toArray)
val src: Iterator[Array[Byte]] = <iterator>
scala> var arr = new Array[Array[Byte]](1024)
...
scala> var i = 0
var i: Int = 0
scala> for (sample <- src) { arr(i) = sample; i+=1 }
scala> var dict = new Array[Byte](32*1024)
...
scala> Zstd.trainFromBuffer(arr, dict, false)
val res8: Long = 13359
scala> Zstd.trainFromBuffer(arr, dict, true)
val res9: Long = 32768
scala> dict
val res13: Array[Byte] = Array(55, -92, 48, -20, 93, -43, 24, 89, 56, 16, -72, -110, 40, 29, -4, -1, -1, -1, -1, -1, -1, -1, 95, 32, 65, 26, -39, 123, 117, -109, -101, 68, 106, 119, 35, -62, -26, 107, -116, 23, -3, 1, -48, 71, -56, 31, -59, -56, 104, 19, 49, -43, 16, 2, ...
Huh, very odd. Might just be my configuration somehow? Coretto on Win 10 64-bit? Not sure.
Yes, I also use Correto 11 but on Linux
I think there was an issue with upstream, when the samples were less than 10. I put some guard-rails so that this not happen anymore with https://github.com/luben/zstd-jni/commit/253adafe345917b1221b2b285bb92def1e38d2af
When using
Zstd.trainFromBuffer
orZstd.trainFromBufferDirect
the JVM segfaults with the following message:The full report is available here. The contents exceed Github's length limit.
The code I'm (currently) using is below, although the same problem happens with
trainFromBuffer
which I took as an excuse to move to direct buffers. Unfortunately that method seems to be having issues, as well.Relevant parts of ByteBufUtil.java:
Relevant parts of ZstdDictionaryImpl.java: