iconara / cql-rb

Cassandra CQL 3 binary protocol driver for Ruby
106 stars 31 forks source link

Memory leak? #102

Closed ds1982 closed 10 years ago

ds1982 commented 10 years ago

i use your "cql-rb" like this

db = Cql::Client.connect(hosts: ['10.0.1.2'])
db.use('database')

filename="test.sql"

countrows=0
selectiontime=0

File.foreach(filename) {|line|

 selectstart = Time.now.to_f
 stmt = db.prepare(line)

 begin
   stmt.execute()
 rescue Exception => e
   puts line
   puts e
 end
 selectend = Time.now.to_f
 selectiontime += (selectend - selectstart)
 countrows+=1
}
printf("Insert-Time;%d;%.6f\n", countrows, selection time)

my SQL-Statements in the read in file look like this:

INSERT INTO "messwert" ("senid","nam","day","timsta","id","wer") VALUES ('8', 'Temperatur', '1394060400', '1394090407107', '12f08947-7af1-43a1-a9ff-be3cd3c6f6a0', '908');
INSERT INTO "messwert" ("senid","nam","day","timsta","id","wer") VALUES ('8', 'Temperatur', '1394060400', '1394090410019', '643ee3a9-a939-471a-8629-721e20a63f0f', '909');

the file has about 160MB of those statements. I am running this for performance tests.

Now i am experiencing a very slow performance compared to php or java and the longer the script runs the more memory it uses. Is there some kind of memory leak in "cql-rb" or am i doing something wrong?

Any hints??

iconara commented 10 years ago

This is a duplicate of #93

You're using prepared statements wrong. Using prepared statements like this means, besides the memory leak (but if it's a leak is a matter of definition, see #93) that each request needs to do two round trips intead of one – in other words: it's probably twice as slow as it needs to be.

The right way to do this is to replace the db.prepare(cql).execute with just db.execute(cql).

If you want it to go even faster (that change alone could double your performance) you should build batches. If you're using cql-rb v1.2 you can smash strings together and send CQL batches, do something like this:

buffer = "BEGIN BATCH\n"
statements = 0
IO.foreach(path) do ||
  buffer << line
  statements += 1
  if statements % 100 == 0
    buffer << 'APPLY BATCH'
    db.execute(buffer)
    buffer = "BEGIN BATCH\n"
  end
end
if buffer.size > 12 # the length of "BEGIN BATCH\n"
  buffer << 'APPLY BATCH'
  db.execute(buffer)
end

If you're using cql-rb 2.0.0pre2 (try it, it's stable, it's a pre-release but it will most likely be the final 2.0) and Cassandra 2.0 you can use the new batch feature:

statements = 0
batch = db.batch
IO.foreach(path) do |line|
  batch.add(line)
  statements += 1
  if statements % 100 == 0
    batch.execute
    batch = db.batch
  end
end
batch.execute

You should also not do rescue Exception. That's a very bad idea in Ruby. Exception in Ruby is like Throwable in Java. rescue Exception catches interrupts and out of memory errors. Just do rescue => a, or in this case the only thing you're ever going to get inside of that block is Cql::CqlError, so catch only that.

ds1982 commented 10 years ago

Alright. Thanks for your fast answer and the good tips, sorry for double-post.

a few (off-topic) questions:

Code for my select Measurements:

stmt = db.prepare('SELECT "wer", "timsta" from "table" where "senid" = ? and "name" = ? and "day" = ? and "timstad" >= ? and "timsta" <= ? limit 86400;')

while lastdayts >= firstdayts
  selectstart = Time.now.to_f

  begin
    rows=stmt.execute(senid, parvor_pn, lastdayts.to_i.to_s, from_orig, to_orig)
  rescue => msg  
    puts msg  
    exit
  end

  selectend = Time.now.to_f
  selectiontime += (selectend - selectstart)
  lastdayts = (lastdayts.to_date-1).to_time
end
iconara commented 10 years ago

I haven’t done any benchmarking against drivers for other languages, it’s way too hard to make meaningful, how do you make sure that you’re not just measuring the performance of Cassandra?

PHP might be faster than Ruby for batch importing, but Java better at selects with big results, I don’t know, and changing some small variable in the setup would give completely different results. You have to measure yourself with your workload, but you can’t assume that what holds for one use case holds for all.

JRuby is fast in general, but if your program is spending 99% of the time waiting for Cassandra and the network it doesn’t matter which Ruby you choose. With JRuby you can multithread your program and parallelize the CPU-bound parts, but if there are no CPU-bound parts it only adds overhead.

If you want even more performance than the batching you need to go async. There is an async API behind the regular API and with that you can pipeline your requests. Pipelining means that you send a request and then immediately sending the next and the next, handling the responses out-of-band. What happens when you use a non-async API is that your program blocks while the network and Cassandra works. This is just wasted time, the program should be doing useful work!

You’ll have to look at the code and the tests to see how to use the async API. It’s experimental – in the sense that it’s not guaranteed to be backwards compatible from version to version, but since it’s the core of the driver it’s not experimental in terms of stability, correctness or quality.

On 12 maj 2014, at 19:56, ds1982 notifications@github.com wrote:

Alright. Thanks for your fast answer and the good tips sorry for double-post.

a few (off-topic) questions:

Do you have compared ruby with other languages? A am experiencing java faster than php faster than ruby with "selects". While inserting php is faster than java and ruby is slowest. (using my old code with correct "prepare"). I will test with with batching soon. in your Readme you state that using jruby brings better performance. In my case i have no difference with my measurements like above between ruby and jruby. Am i doing anything additionally wrong for better jruby performance? Moving the prepare statement outside the loop did not really bring better performance with my select-tests (i a little bit, but not significant...)Could this be correct? (with insert-tests like above performance was better and Memory Leak issue was fixed...) Code for my select Measurements:

stmt = db.prepare('SELECT "wer", "timsta" from "table" where "senid" = ? and "name" = ? and "day" = ? and "timstad" >= ? and "timsta" <= ? limit 86400;')

while lastdayts >= firstdayts selectstart = Time.now.to_f

begin rows=stmt.execute(senid, parvor_pn, lastdayts.to_i.to_s, from_orig, to_orig) rescue => msg
puts msg
exit end

selectend = Time.now.to_f selectiontime += (selectend - selectstart) lastdayts = (lastdayts.to_date-1).to_time end — Reply to this email directly or view it on GitHub.

ds1982 commented 10 years ago

how do you make sure that you’re not just measuring the performance of Cassandra?

I think in this case all measurements should be equal.

You have to measure yourself with your workload, but you can’t assume that what holds for one use case holds for all.

Alright. This is what i wanted to hear. In my case i just want to give some recommendations which driver to use for our use case and setup

If you want even more performance than the batching you need to go async.

I will test this, too

I think we can close this issue. Thanks for your good support!

iconara commented 10 years ago

My pleasure, if you need any more help just open another issue. It doesn’t have to be about something not working, I’m happy to help.

On 12 maj 2014, at 21:26, ds1982 notifications@github.com wrote:

how do you make sure that you’re not just measuring the performance of Cassandra? I think in this case all measurements should be equal.

You have to measure yourself with your workload, but you can’t assume that what holds for one use case holds for all. Alright. This is what i wanted to hear. In my case i just want to give some recommendations which driver to use for our use case and setup

If you want even more performance than the batching you need to go async. I will test this, too

I think we can close this issue. Thanks for your good support!

— Reply to this email directly or view it on GitHub.