graphite-project / carbon

Carbon is one of the components of Graphite, and is responsible for receiving metrics over the network and writing them down to disk using a storage backend.
http://graphite.readthedocs.org/
Apache License 2.0
1.51k stars 490 forks source link

[DISCUSS] release threads between writer runs #817

Closed DanCech closed 4 years ago

DanCech commented 6 years ago

I have received a report of the writer thread stopping, which results in the metric cache growing and nothing being written to disk.

I'm not sure what the underlying cause is, though it's possible that accessing reactor.running from the background thread is part of the issue. I also noted that when the thread calls time.sleep() between runs it is blocking for no good reason.

This PR modified the writeForever and writeTagsForever functions to run as regular twisted async functions, using reactor.callLater() to schedule their next run, and threads.deferToThread to call writeCachedDataPoints and writeTags in threads from the twisted threadpool. This avoids the need to access reactor from those threads, and leaves scheduling in the main thread.

The only behavior change this should introduce is that I set the wait time after draining the queue to 1s for writeTagsForever to match writeForever, though I do wonder whether it would make more sense to shorten that.

I have not yet done any load testing on this, and would appreciate any feedback.

codecov-io commented 6 years ago

Codecov Report

Merging #817 into master will increase coverage by 0.02%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #817      +/-   ##
==========================================
+ Coverage   49.62%   49.65%   +0.02%     
==========================================
  Files          37       37              
  Lines        3434     3432       -2     
  Branches      494      492       -2     
==========================================
  Hits         1704     1704              
+ Misses       1621     1619       -2     
  Partials      109      109              
Impacted Files Coverage Δ
lib/carbon/writer.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dfda3d4...a0e55db. Read the comment docs.

deniszh commented 6 years ago

Well, it should work. Not sure how we can test that properly, though. I'm working on cluster implementation on docker compose, maybe I can add it there, and test it using haggar...

piotr1212 commented 6 years ago

Seems like a good thing to do. I was already wondering about the 1 second sleep before, seems like a bit long.

DanCech commented 6 years ago

@deniszh we should figure out how to test this for 1.1.5

deniszh commented 6 years ago

Yes, still working on that. Docker image is configurable now, so, I'm gonna create docker compose setup with haggar and test how it works.

deniszh commented 5 years ago

Ok, benchmarking is hard. :) I did a small test of this patch vs 1.1.4 using this setup - https://github.com/graphite-project/graphite-test-docker (MBP 2017 / 4CPU / 4G in Docker, 10 clients x 1000 metrics every 10 seconds, run for 1 hour). I did second test for a couple of hours of so, so it has some graphs higher, so, probably 1 hour is too small for stabilize parameters. 1.1.4:

screenshot 2018-11-04 at 10 55 33 screenshot 2018-11-04 at 10 55 45 screenshot 2018-11-04 at 10 55 52

This patch:

screenshot 2018-11-04 at 15 05 37 screenshot 2018-11-04 at 15 05 48 screenshot 2018-11-04 at 15 05 57

I.e. on the first sight it works fine. You can use the same repo to run own tests. /cc @DanCech

deniszh commented 5 years ago

OK, what we're gonna do with that? go / no-go? @DanCech @piotr1212 @iksaif ?

piotr1212 commented 5 years ago

Your graphs look different between runs but your conclusion is that it looks good? cpu, load, number of metrics in cache, memory used are all higher.

I don't expect this to have any influence on performance but don't have time to test this right now.

deniszh commented 5 years ago

The second graph is after 3-4 hours and first - after the first hour. As I said benchmarking is hard and I don't even try to compare old and new code performance-wise, only provide (quite weak) proof that this PR is not horribly blown in your face.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

codecov-commenter commented 4 years ago

Codecov Report

Merging #817 into master will increase coverage by 0.02%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #817      +/-   ##
==========================================
+ Coverage   49.62%   49.65%   +0.02%     
==========================================
  Files          37       37              
  Lines        3434     3432       -2     
  Branches      494      492       -2     
==========================================
  Hits         1704     1704              
+ Misses       1621     1619       -2     
  Partials      109      109              
Impacted Files Coverage Δ
lib/carbon/writer.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dfda3d4...a0e55db. Read the comment docs.

DanCech commented 4 years ago

Heh, I'd forgotten about this one. Seems like it would still be a good idea

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.