qubole / rubix

Cache File System optimized for columnar formats and object stores
Apache License 2.0
183 stars 74 forks source link

Schedule validators on same thread to avoid conflicts #243

Closed jordanw-bq closed 5 years ago

jordanw-bq commented 5 years ago

Issue When the HeartbeatService starts, the caching & file validators are both started asynchronously, which means they will both be running at the same time. CachingValidator caches a file to verify that behaviour, which means this file will be available for FileValidator to validate. However, depending on the timing of the validator threads, when CachingValidator removes the cached file & metadata once it completes, the FileValidator will no longer be able to find this file and throw an exception (as seen below)

This behavior has been seen in a number of PR commits, but not consistently. As well, this issue does not occur when testing locally.

Fix Running these validators on the same thread should ensure that there are no conflicts for either validator when they are run.

Stack trace:

2018-11-20 22:55:05 [DEBUG] bookkeeper.BookKeeper:  blockLocation is: LOCAL for path file:///tmp/rubixCachingTestFile offset 0 length 2600
2018-11-20 22:55:05 [INFO ] bookkeeper.BookKeeper: Initializing FileSystem org.apache.hadoop.fs.LocalFileSystem@379c4f3b for Path file:/tmp/rubixCachingTestFile
2018-11-20 22:55:05 [DEBUG] core.ReadRequestChain: Request to add ReadRequest: [0, 1048576, 0, 1048576, 0]
2018-11-20 22:55:05 [DEBUG] core.ReadRequestChain: Added ReadRequest: [0, 1048576, 0, 1048576, 0]
2018-11-20 22:55:05 [DEBUG] core.RemoteReadRequestChain: Executing ReadRequest: [0, 1048576, 0, 1048576, 0]
2018-11-20 22:55:05 [DEBUG] core.RemoteReadRequestChain: PrefixLength: 0 SuffixLength: 0
2018-11-20 22:55:05 [DEBUG] core.RemoteReadRequestChain: Trying to Read 1048576 bytes into destination buffer
2018-11-20 22:55:05 [DEBUG] core.RemoteReadRequestChain: Read 2600 bytes into destination buffer
2018-11-20 22:55:05 [INFO ] core.RemoteReadRequestChain: Trying to copy [0 - 2600] bytes into cache from destination buffer offset 0 into localFile /tmp/TestBookKeeperServer0/fcache//tmp/rubixCachingTestFile
2018-11-20 22:55:05 [DEBUG] core.RemoteReadRequestChain: Copied 2600 requested bytes into cache
2018-11-20 22:55:05 [INFO ] core.RemoteReadRequestChain: Read 2600 bytes from remote localFile, added 2600 to destination buffer
2018-11-20 22:55:05 [DEBUG] core.RemoteReadRequestChain: Setting cached from : 0 block to : 1
2018-11-20 22:55:05 [DEBUG] bookkeeper.BookKeeper: Updating cache for file:///tmp/rubixCachingTestFile StarBlock : 0 EndBlock : 1
2018-11-20 22:55:05 [WARN ] bookkeeper.FileMetadata: Evicting file:///tmp/rubixCachingTestFile due to EXPLICIT
2018-11-20 22:55:05 [ERROR] validation.FileValidator: Encountered issue while verifying files
java.nio.file.NoSuchFileException: /tmp/TestBookKeeperServer0/fcache/tmp/rubixCachingTestFile_mdfile
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
    at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
    at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:97)
    at java.nio.file.Files.readAttributes(Files.java:1686)
    at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:105)
    at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:199)
    at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:199)
    at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:69)
    at java.nio.file.Files.walkFileTree(Files.java:2602)
    at java.nio.file.Files.walkFileTree(Files.java:2635)
    at com.qubole.rubix.bookkeeper.validation.FileValidator.validateCache(FileValidator.java:80)
    at com.qubole.rubix.bookkeeper.validation.FileValidator.runOneIteration(FileValidator.java:46)
    at com.google.common.util.concurrent.AbstractScheduledService$1$1.run(AbstractScheduledService.java:174)
    at com.google.common.util.concurrent.Callables$3.run(Callables.java:95)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
    at java.lang.Thread.run(Thread.java:748)