twitter-archive / kestrel

simple, distributed message queue system (inactive)
http://twitter.github.io/kestrel
Other
2.78k stars 313 forks source link

Journal files not erased as they roll over #124

Open kperi opened 11 years ago

kperi commented 11 years ago

Quoting @andrewclegg and @robey from this post: https://github.com/robey/kestrel/issues/16

"We have queues where there are hundreds of gigabytes of old journal files for queues that currently have items=0. Obviously this takes up a lot of disk space, but it also means Kestrel takes ages to start up, as it replays all those old operations.

How can we force Kestrel to compact them?"

robey commented 11 years ago

Sorry for the delay -- I was out of town for a while.

There are two scripts built with kestrel that should help here:

$ ./dist/kestrel/scripts/qdump.sh --help

That will dump out a (sort of) human-readable list of the operations in a journal file, without the associated data. It might be useful for figuring out what's messed up, although if the journal is "hundreds of gigabytes", it might be too time-consuming to matter.

$ ./dist/kestrel/scripts/qpack.sh --help

This takes a set of journal files and packs them into their minimal operations (usually one ADD for each item that's currently in the queue). You should take the bad server out of rotation, run this script on its journal files, and then start it back up again with only the new journal file. (You can delete the old ones after it's finished packing the new one.) It might be helpful to "qdump" the new file before starting, in case there's something wonky.

kperi commented 11 years ago

Hi, Thanks for the reply.

I've tried qdump on some of the files and I am getting something like:

qdump.sh my_queue.1373865238490

00000000 ADD 2601
00000a3e ADD 2641
000014a4 REM 000014a5 REM 000014a6 ADD 2597
00001ee0 ADD 2616
0000292d ADD 2437
000032c7 REM 000032c8 ADD 2572
00003ce9 ADD 2624
............... 0199cf REM 000199d0 REM 000199d1 ADD 2649
0001a43f ADD 2649
0001aead ADD 2631
0001b909 REM 0001b90a REM 0001b90b REM 0001b90c REM

Exception in thread "main" java.util.NoSuchElementException: queue empty at scala.collection.mutable.Queue.dequeue(Queue.scala:65) at net.lag.kestrel.tools.QueueDumper.dumpItem(QDumper.scala:102) at net.lag.kestrel.tools.QueueDumper$$anonfun$apply$2.apply(QDumper.scala:47) at net.lag.kestrel.tools.QueueDumper$$anonfun$apply$2.apply(QDumper.scala:45) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.collection.Iterator$$anon$22.foreach(Iterator.scala:451) at net.lag.kestrel.tools.QueueDumper.apply(QDumper.scala:45) at net.lag.kestrel.tools.QDumper$$anonfun$main$1.apply(QDumper.scala:201) at net.lag.kestrel.tools.QDumper$$anonfun$main$1.apply(QDumper.scala:199) at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59) at scala.collection.immutable.List.foreach(List.scala:76) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:30) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:44) at net.lag.kestrel.tools.QDumper$.main(QDumper.scala:199) at net.lag.kestrel.tools.QDumper.main(QDumper.scala)

Other files in the same queue seem not to have a problem with qdump, ie I am getting a full list of ops and qdump exits without exceptions.


Additionally, the overal queue + journal size was reduced from 154G to 500m just after start pushing new data to the queue

Thanks, Kostas

andrewclegg commented 11 years ago

We've just had the same situation occur that @kperi described above.

I'm trying to rescue the situation without losing any data, but when I try to run qpack.sh on the out-of-control journal files, I get errors like these:

Packing journals...
Packing: 2.2M   2.2M  Exception in thread "main" net.lag.kestrel.BrokenItemException: java.io.IOException: Unexpected EOF
        at net.lag.kestrel.Journal.readJournalEntry(Journal.scala:414)
        at net.lag.kestrel.Journal.next$1(Journal.scala:422)
        at net.lag.kestrel.Journal$$anonfun$next$1$1.apply(Journal.scala:427)
        at net.lag.kestrel.Journal$$anonfun$next$1$1.apply(Journal.scala:427)
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1060)
        at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1052)
        at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:952)
        at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:952)
        at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:941)
        at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:946)
        at scala.collection.TraversableOnce$FlattenOps$$anon$1.hasNext(TraversableOnce.scala:391)
        at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:457)
        at scala.collection.Iterator$class.foreach(Iterator.scala:772)
        at scala.collection.Iterator$$anon$22.foreach(Iterator.scala:451)
        at net.lag.kestrel.JournalPacker.apply(JournalPacker.scala:73)
        at net.lag.kestrel.tools.QPacker$.main(QPacker.scala:65)
        at net.lag.kestrel.tools.QPacker.main(QPacker.scala)
Caused by: java.io.IOException: Unexpected EOF
        at net.lag.kestrel.Journal.readBlock(Journal.scala:443)
        at net.lag.kestrel.Journal.readJournalEntry(Journal.scala:382)
        ... 16 more

This suggests one possible reason for the journals not getting packed automatically: the journal packer thread is failing on corrupt journal files because of this exception.

I couldn't find any direct evidence of this in the log files, but then, the disks were full due to this issue, so maybe the logger couldn't write the appropriate stacktrace. A bit speculative, but it's all I've got so far...

andrewclegg commented 11 years ago

Note: there were a lot of timestamped journal files with size 0, but I removed all of these first before trying to run qpack.sh.