Closed cramja closed 8 years ago
Well, please show the log outputs for the hanging case using the following comment:
GLOG_v=5 GLOG_logtostderr=1 /path/to/quickstep_cli_shell
I linked it because it's a lot of text.
The file shows the output from a debug version of quickstep after issuing the SELECT COUNT(*)
that causes the hang up. (Production version doesn't produce log messages).
How I am interpreting this output is that the optimizer successfully passes the query on to the foreman in the case of the 4th COUNT query which caused the hangup as it successfully creates an optimized physical plan: I0308 14:25:31.845458 12279 PhysicalGenerator.cpp:88] Optimized physical plan:
@pateljm mentioned that he'd like to hold a bug hunting session where @saketj , @jianqiao and whoever else might be interested would go through and try to find the cause of this.
@cramja The logs you provided only shows the outputs from a parser tree to the optimized physical plan for the same query four times. (And yes, these outputs created using DVLOG
won't show up in the release mode.) However, we need more info about this hanging bug after generating the physical plan.
What you need to do is, to add the logging statement in /cli/QuickstepCli.cpp
and /query_execution
, including Foreman
and Worker
.
See https://github.com/Pivotal-DataFabric/quickstep/blob/distributed-prototype/query_execution/Foreman.cpp#L260 for example.
Btw, which branch and commit you are using? Have you checkout any previous commits to see whether this hang exists as well?
@jianqiao @pateljm and I sat down today to look for the bug. We determined it is a concurrency bug in the StorageManager. Specifically, how locks are acquired around and in the method:
void StorageManager::makeRoomForBlock(
const size_t slots,
block_id locked_block_id) { .. }
If you are trying to recreate the bug, use a large number of workers, a large dataset (sf10 ssb), and a large buffer pool (default is fine). Having many threads increases the likelihood that there will be the interleaving which causes deadlock.
Also, @jianqiao suggested a fix using lock acquisition ordering which he is now working on.
For my own understanding, I whipped up a sample program that emulates the solution to the problem. To run it, try using the name of the gist as the command to compile it.
The main difference is that in the StorageManager code, there's a very long chain of function calls between lock acquisitions:
+first Lock--------------------------------------+second lock
loadBlock -> loadBlockOrBlob -> allocateSlots -> makeRoomForBlock
Looking at the specific errors which we are running into on the bugfix branch.
If you run ctest -R executiongenerator -j80 --output-on-failure
you'll likely get a corruptPersistantStorage error (interleaving dependent). Try it with one thread and it will not happen.
I think it's likely that the ExecutionGenerator tests are creating and deleting the same tmp file (qsblk_0001_0000000000000000001.qsb
) to hold their blocks, and when we do a read with in the StorageManager code (num_slots @jianqiao ), it's getting bungled up results.
It's curious because there should be a lock on that file. Will look into again tomorrow.
I added a check to the ExecutionGenerotorTestRunner to see if the .qsb
file existed before proceeding with the test. However, this was unsuccessful because you could still get an interleaving where 2 threads tested for the file's non-existence at the same time, and passed. They would then proceed to create the same block.
@jianqiao Making a PR, will begin commenting there instead of on this issue.
Resolved by #85. Thanks @cramja and @jianqiao!
I was running SSB queries this past weekend and noticed that some of them would hang indefinitely.
To reproduce the issue:
./quickstep_cli_shell --num_workers=80 --printing_enabled=false
lineorder
table from SSB already loaded into a shared folder on the Quickstep box if you would like to use it.Now type this a bunch of times. It should hang on the 4-5th repetition.
Notes:
COUNT
queries to quickstep from a single file like$QS < count.sql
will not reproduce the issue.