go-graphite / carbon-clickhouse

Graphite metrics receiver with ClickHouse as storage
MIT License
186 stars 47 forks source link

panic: runtime error #60

Closed Secathor closed 3 years ago

Secathor commented 4 years ago

Hello,

Just today we encountered an issue which cause carbon-clickhouse to exit. After it stops working, we have to manually remove cache where carbon-clickhouse store it's temporary files. After that procedure, it may work for few minutes/even like 1-2 hours, but problem apperas again and again..

What may cause this stange behaviour ? Version 0.11.0, we have tested version from master, same problem :(


May 13 09:21:44 hostname carbon-clickhouse[38105]: panic: runtime error: index out of range [3] with length 0
May 13 09:21:44 hostname carbon-clickhouse[38105]: goroutine 12 [running]:
May 13 09:21:44 hostname carbon-clickhouse[38105]: encoding/binary.littleEndian.PutUint32(...)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/usr/lib/go-1.13/src/encoding/binary/binary.go:68
May 13 09:21:44 hostname carbon-clickhouse[38105]: github.com/lomik/carbon-clickhouse/helper/RowBinary.(*WriteBuffer).WriteUint32(...)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/helper/RowBinary/write_buffer.go:143
May 13 09:21:44 hostname carbon-clickhouse[38105]: github.com/lomik/carbon-clickhouse/uploader.(*Index).parseFile(0xc0000100b0, 0xc016c0dce0, 0x5d, 0xb5ac20, 0xc0000ce640, 0x0, 0x0, 0x0)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/uploader/index.go:116 +0xe4f
May 13 09:21:44 hostname carbon-clickhouse[38105]: github.com/lomik/carbon-clickhouse/uploader.(*cached).upload(0xc00040fec0, 0xb69f60, 0xc000022c80, 0xc00056f380, 0xc016c0dce0, 0x5d, 0xc, 0x0)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/uploader/cached.go:80 +0x2c2
May 13 09:21:44 hostname carbon-clickhouse[38105]: github.com/lomik/carbon-clickhouse/uploader.(*Base).uploadWorker(0xc00002c300, 0xb69f60, 0xc000022c80)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/uploader/base.go:155 +0x32e
May 13 09:21:44 hostname carbon-clickhouse[38105]: github.com/lomik/carbon-clickhouse/uploader.(*Base).Start.func1.1(0xb69f60, 0xc000022c80)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/uploader/base.go:137 +0x3e
May 13 09:21:44 hostname carbon-clickhouse[38105]: github.com/lomik/carbon-clickhouse/helper/stop.(*Struct).StartFunc.func1.1(0xc0002cc470, 0xb69f60, 0xc000022c80, 0xc00002c300)
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/helper/stop/stop.go:49 +0x3a
May 13 09:21:44 hostname carbon-clickhouse[38105]: created by github.com/lomik/carbon-clickhouse/helper/stop.(*Struct).StartFunc.func1
May 13 09:21:44 hostname carbon-clickhouse[38105]: #011/root/carbon-clickhouse/_vendor/src/github.com/lomik/carbon-clickhouse/helper/stop/stop.go:48 +0x94
May 13 09:21:44 hostname systemd[1]: carbon-clickhouse.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 13 09:21:44 hostname systemd[1]: carbon-clickhouse.service: Failed with result 'exit-code'```
Secathor commented 4 years ago

Hi again,

I would like to add additional info - the problem started on two different clusters at same time. It looks like, maybe, users (or their scripts etc) are doing something nasty.

Unfortunately I wasn't able to catch anything suspicious yet..

Last but not least, while carbon-clickhouse can not start, logs in clickhouse.err.log shows that it is trying to upload someting into ClickhouseDB, but it just can't:


2020.05.13 19:25:55.430312 [ 333 ] {7f2763bb-1d62-4e0c-b794-1bd7e7d7bc51} <Error> executeQuery: Code: 33, e.displayText() = DB::Exception: Cannot read all data. Bytes read: 139. Bytes expected: 162. (version 19.17.4.11 (official build)) (from [::ffff:127.0.0.1]:48256) (in query: INSERT INTO graphite.reverse_distributed (Path, Value, Time, Date, Timestamp) FORMAT RowBinary ), Stack trace:

0. 0x56159052ef60 StackTrace::StackTrace() /usr/bin/clickhouse
1. 0x56159052ed35 DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) /usr/bin/clickhouse
2. 0x561590240c82 ? /usr/bin/clickhouse
3. 0x561593d64b46 DB::BinaryRowInputFormat::readRow(std::vector<COW<DB::IColumn>::mutable_ptr<DB::IColumn>, std::allocator<COW<DB::IColumn>::mutable_ptr<DB::IColumn> > >&, DB::RowReadExtension&) /usr/bin/clickhouse
4. 0x561594178e21 DB::IRowInputFormat::generate() /usr/bin/clickhouse
5. 0x561593d3866e DB::ISource::work() /usr/bin/clickhouse
6. 0x561593d0fa25 DB::InputStreamFromInputFormat::readImpl() /usr/bin/clickhouse
7. 0x56159378501a DB::IBlockInputStream::read() /usr/bin/clickhouse
8. 0x561593a19c48 DB::InputStreamFromASTInsertQuery::readImpl() /usr/bin/clickhouse
9. 0x56159378501a DB::IBlockInputStream::read() /usr/bin/clickhouse
10. 0x5615937a350b DB::copyData(DB::IBlockInputStream&, DB::IBlockOutputStream&, std::atomic<bool>*) /usr/bin/clickhouse
11. 0x561593a18d13 DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, DB::Context&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>) /usr/bin/clickhouse
12. 0x5615905c3dee DB::HTTPHandler::processQuery(Poco::Net::HTTPServerRequest&, HTMLForm&, Poco::Net::HTTPServerResponse&, DB::HTTPHandler::Output&) /usr/bin/clickhouse
13. 0x5615905c6169 DB::HTTPHandler::handleRequest(Poco::Net::HTTPServerRequest&, Poco::Net::HTTPServerResponse&) /usr/bin/clickhouse
14. 0x5615944b80e9 Poco::Net::HTTPServerConnection::run() /usr/bin/clickhouse
15. 0x5615944b4660 Poco::Net::TCPServerConnection::start() /usr/bin/clickhouse
16. 0x5615944b4d7d Poco::Net::TCPServerDispatcher::run() /usr/bin/clickhouse
17. 0x561595b8aef1 Poco::PooledThread::run() /usr/bin/clickhouse
18. 0x561595b88c9c Poco::ThreadImpl::runnableEntry(void*) /usr/bin/clickhouse
19. 0x5615962fee00 ? /usr/bin/clickhouse
20. 0x7f34c6e7f6db start_thread /lib/x86_64-linux-gnu/libpthread-2.27.so
21. 0x7f34c679c88f clone /lib/x86_64-linux-gnu/libc-2.27.so```
Secathor commented 4 years ago

The most important thing is that both setups (both clusters) were working well, like really well for past 6 months.

Secathor commented 4 years ago

I want to put an update here - we managed to find what caused this error in our environment. In our architecture we use StatsD -> carbon-c-relay -> carbon-clickhouse. Briefly, a user was sending very long metrics to StatsD , somehow this metrics didn't have value and timestamp in carbon-clickhouse, but in the end, they where merged one metrics name after another.

Like this:

Normal behaviour:

metric.name.one 666 timestamp metric.name.two 777 timestamp

We received:

metric.name.one.metric.name.two (those were 6kb+ length metric names, without value and timestamp when we checked carbon-clickhouse caches).

We helped ourself by reducing metric allowed byte size (-m parameter in carbon-c-relay) to more realistic value.

bzed commented 4 years ago

We are seeing the same issue here, started to happen after upgrading telegraf to the latest version (which is sending data in the graphite format here).

franek1709 commented 3 years ago

I encountered a similar issue last week. It looks like metrics with long names (especially those with names consisting many metrics levels) can cause panic errors during these metric writes in uploaders. It was very easy to reproduce this issue with carbon-clickhouse executed in docker.

I looked at the code and now I know the cause: uploader uses WriteBuffer objects for writing metrics. Content to write is stored in a byte array called Body (maximum size of this array is 524288). Metric that causes a problem in my case has a very long name (it's about 9KB and has 258 levels inside). At the first glance 512KB should be enough to store data for 9KB metric. Unfortunately - for each metric fetched from binary file multiple entries are written into WriteBuffer object:

To make things clear I give example for metric our.quite.short.metric.for.carbon.clickhouse. It is written into WriteBuffer 9 times:

  1. our.quite.short.metric.for.carbon.clickhouse
  2. clickhouse.carbon.for.metric.short.quite.our
  3. our.quite.short.metric.for.carbon.clickhouse
  4. our.quite.short.metric.for.carbon.
  5. our.quite.short.metric.for.
  6. our.quite.short.metric.
  7. our.quite.short.
  8. our.
  9. clickhouse.carbon.for.metric.short.quite.our 

For the purpose of this example I intentionally omitted additional data that are written to WriteBuffer along with metric name - they are not relevant for this issue.

As we see in example one simple quite short metric was written to WriteBuffer 9 times. My long metric will need about 260 WriteBuffer writes. As the name of my metric is very long (9KB) an allocated buffer won't be enough to store all these data. Panic error is thrown before the first write operation that exceeds the allocated buffer.

In our case it was simply an error on the client side - we already asked about fixing this problem. Nonetheless I think even such cases should not result in a carbon-clickhouse crash. To make it worse the only way to fix this problem I know currently is deleting files containing metrics that caused this crash. And deleting such files results in losing other correct metrics.   In some cases even file deletion will not repair this problem for long. If such metrics are sent periodically by statsd problem will occur again after a few minutes.  

After finding a root cause for this problem I decided to prepare a fix for it. I created a simple mechanism for dropping metrics with names longer than the threshold specified for each metric listener (in a similar way as carbon-clickhouse drops old or future metrics). I will create a pull request with a proposal of such change.

Felixoid commented 3 years ago

Please, check, if it's fixed in the current master?

franek1709 commented 3 years ago

I just finished tests - I confirm fix works as I planned. Can you create new release with my commit?

Felixoid commented 3 years ago

I'll try to release it in the following week.