Open jfs opened 8 years ago
I performed a few tests using the supplied test harness, and have some additional information to provide:
taskset
was used to bind the JVM process to a single coretaskset
was used to bind the process to 2/4 available processor coresi have a problem similar like this. but my program only have one worker thread, you can see it as a single thread program, and sometimes the Put
operator will fail, and the leveldb's log show:
2017/05/26-15:00:44.947831 7f62a505d7c0 Delete type=3 #1
2017/05/26-15:00:44.948561 7f62937ff700 Level-0 table #5: started
2017/05/26-15:00:44.948622 7f62937ff700 Level-0 table #5: 0 bytes OK
2017/05/26-15:00:44.990261 7f62937ff700 Delete type=0 #3
2017/05/26-15:00:44.990352 7f62937ff700 Manual compaction at level-0 from (begin) .. (end); will stop at (end)
2017/05/27-07:58:16.065847 7f62937ff700 Level-0 table #7: started
2017/05/27-07:58:16.116611 7f62937ff700 Level-0 table #7: 3081599 bytes OK
2017/05/27-07:58:16.122961 7f62937ff700 Delete type=0 #4
2017/05/27-14:02:05.646892 7f62937ff700 Level-0 table #9: started
2017/05/27-14:02:05.691227 7f62937ff700 Level-0 table #9: 3188392 bytes OK
2017/05/27-14:02:05.697950 7f62937ff700 Delete type=0 #6
2017/05/27-14:29:49.691848 7f62937ff700 Compacting 1@1 + 1@2 files
2017/05/27-14:29:49.707433 7f62937ff700 Generated table #10: 377 keys, 6152 bytes
2017/05/27-14:29:49.707472 7f62937ff700 Compacted 1@1 + 1@2 files => 6152 bytes
2017/05/27-14:29:49.712960 7f62937ff700 compacted to: files[ 0 0 1 0 0 0 0 ]
2017/05/27-14:29:49.713244 7f62937ff700 Delete type=2 #9
2017/05/27-14:29:49.713990 7f62937ff700 Delete type=2 #7
2017/05/27-19:03:13.651491 7f62937ff700 Level-0 table #12: started
2017/05/27-19:03:13.696402 7f62937ff700 Level-0 table #12: 3240067 bytes Corruption: not an sstable (bad magic number)
CentOS 7.2 64bit, multi core VM, GCC 4.4.7 LevelDB version is 1.18, now i upgrade it to 1.20 (not test yet)
how can i solve this problem? thanks a lot @cmumford
HI All, there are anyway to fix this issue? I am facing this on OpenWRT. I can re-open the database if I remove *.log file.
One of the LevelDB co-creators, ghemawat, appears to be aware of this problem. I don't know if he knows how many people are being affected by it. Many of us in the STORJ project seem to be affected by it. Recommend reaching out to him. (see issue below)
Does somebody have a stand-alone C++ test that demonstrates the problem?
On Tue, Aug 14, 2018 at 10:21 AM Hypocritus notifications@github.com wrote:
One of the LevelDB co-creators, ghemawat, is aware of this problem. I don't know if he knows how many people are being affected by it. Many of us in the STORJ project seem to be affected by it. Recommend reaching out to him. (see issue below)
474 https://github.com/google/leveldb/issues/474
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/leveldb/issues/408#issuecomment-412949812, or mute the thread https://github.com/notifications/unsubscribe-auth/AHuM0cvKshZEvPqKOIJ-L67U9oMplDWOks5uQwcjgaJpZM4J9KnP .
We have found a scenario where we can create
not an sstable (bad magic number)
errors consistently on Linux while wrapping the library in the JNA api. We believe we are using the LevelDB api correctly, though if you see an error in this test code, we'd love to know that's what's causing it.We have experienced errors on every Linux we have tried. It seems to happen faster on virtualized Linux, but we've also run into errors while running Linux directly on metal. The magic sauce to getting a faster time to failure is setting the
-Xloggc
option on the JVM to a location that does not exist. We believe this is creating enough latency in the JVM to exacerbate a problem that is always present.The following will complete (though, as we said, given enough time we believe it will still see corruption errors):
The following will fail with corruption without a change in the code:
With this issue report we are providing a gradle project and scripts for a test harness that is able to recreate this error in seconds by running on Docker (by far the worst time to failure we have found). While running in other configurations the time to failure is measured in hours (running on Azure or AWS) or days (
-Xloggc
pointing to a location that exists), but we are confident in saying that it's not a matter of if, but when the error will occur.We would be more than happy to test any patches or answer your questions. We've spent weeks working on narrowing down this problem to its simplest example as it is critical to us that it is fixed.
Versions used for the test
Test error log
As you can see in the log excerpt below, the time to failure for this test harness was less than a second (sometimes it takes several seconds).
Running the test
leveldbtest/bin/run_docker.sh
What the test does
This test simply creates and removes random byte blobs from the database. The blobs are randomly sized, and written in random length batches. When the total bytes in the database exceeds 10M the blobs are deleted and the process continues.
The main method is in
leveldb.test.LevelDbBlockStoreHarness
. At the top of the file are a number of knobs you can turn to modify the execution of the test harness if you feel like during your testing.Notes
Test Harness
sstable_corruption_test.zip