bug report，InstanceWriteBufferSize cant be more than about 1000

MOON-CLJ commented 4 years ago

If this is a bug report, please provide a test case and the error output. Useful information:

your orchestrator.conf.json config file/contents

most related

InstanceWriteBufferSize=10000
BufferInstanceWrites=true
InstanceFlushIntervalMilliseconds =10000

what did you expect to happen?

database_service table's record be updated every InstanceFlushIntervalMilliseconds。

what happened?

database_service table's record dont update for long time。

orchestrator logs; please use --debug --stack for maximum verbosity

2020-07-03 22:31:46 ERROR goroutine 22 flushInstanceWriteBuffer: 2020-07-03 22:31:46 ERROR goroutine 22 flushInstanceWriteBuffer last_seen: Error 1390: Prepared statement contains too many placeholders
2020-07-03 22:31:56 ERROR goroutine 22 Error 1390: Prepared statement contains too many placeholders
2020-07-03 22:31:56 ERROR goroutine 22 flushInstanceWriteBuffer last_seen: Error 1390: Prepared statement contains too many placeholders
2020-07-03 22:31:56 ERROR goroutine 22 flushInstanceWriteBuffer: 2020-07-03 22:31:56 ERROR goroutine 22 flushInstanceWriteBuffer last_seen: Error 1390: Prepared statement contains too many placeholders
2020-07-03 22:32:07 ERROR goroutine 22 Error 1390: Prepared statement contains too many placeholders
2020-07-03 22:32:07 ERROR goroutine 22 flushInstanceWriteBuffer last_seen: Error 1390: Prepared statement contains too many placeholders
2020-07-03 22:32:07 ERROR goroutine 22 flushInstanceWriteBuffer: 2020-07-03 22:32:07 ERROR goroutine 22 flushInstanceWriteBuffer last_seen: Error 1390: Prepared statement contains too many placeholders

After a long debug，we found the above anomaly in our logs.

so the problem's InstanceWriteBufferSize=10000 is so big that mkInsertOdku ran into Error 1390. After a quick search，https://stackoverflow.com/a/24447922, There's a limit 65535 of "Prepared statement placeholders"，

we sugguest to check that InstanceWriteBufferSize does not exceed 1000 when loading the configuration，because one instance have about 64 columns/placaholders。

@shlomi-noach cc @yangeagle @sjmudd

sjmudd commented 4 years ago

Yes, you are right. This looks related to https://github.com/mysql/mysql-server/blob/8.0/sql/sql_prepare.cc#L1422 which implies that the maximum number of placeholders is 65535.

Each insert on duplicate key insert statement uses 70 placeholders meaning that 10,000 x 70 just goes over the limit.

I'd be tempted to reduce InstanceWriteBufferSize. Do you have 10,000 hosts or more?
Also InstanceFlushIntervalMilliseconds looks much too high. Why are you flushing only every 10 seconds?

To give you some context in my environment I'm using:

"InstanceFlushIntervalMilliseconds": 50
"InstanceWriteBufferSize": 1000
"InstancePollSeconds": 5

Clearly you need to adjust these parameters based on the number of servers you are polling and the frequency you are doing that.

I'lll agree that the code could be better here and be potentially limit InstanceWriteBufferSize to say 936 which seems to be the implied limit at the moment.

The other option would be to avoid using prepared statements which converts 2 round trips for an insert into one. I thought the code I was using was doing that (and we'd pushed it upstream) but I seem to be mistaken so need to check. This reduces the number of queries by a factor of 2 (and latency) and also would remove this limit as no parameters would be provided so it has other potential benefits. @luisyonaldo something for us to check.

For now my suggestion would be to adapt the parameters you are using to values which are perhaps more appropriate. The main one being to reduce your flush time as many other queries are made to the orchestrator database and they'll be querying data which is up to 10 seconds stale. That's likely to cause all sorts of issues.

sjmudd commented 4 years ago

Oh. I see this from a commit 2 years ago:

commit 5162ac9bb84bac18c9bfb55501c0633e260627e4
Author: Shlomi Noach <shlomi-noach@github.com>
Date:   Mon Jan 8 11:44:17 2018 +0200

    Using interpolateParams everywhere; not preparing statements anywhere anymore

which implies that no parameters are being used at all

This does not match the error statement you were seeing so either something got missed or you are using code which is quite old. What version of orchestrator are you using? That's something you did not report and seems to be relevant here.

sjmudd commented 4 years ago

Also check: https://github.com/openark/orchestrator/blob/master/go/db/db.go#L55 which shows that the local client side interpolation of parameters should be enabled which implies MySQL should never see the ? placeholders so should not be complaining about too many placeholders being used.

I've not looked at this code for a long time, nor have I been following Shlomi's updates upstream. So took a while to verify.

MOON-CLJ commented 4 years ago

@sjmudd we rebase on the last release v3.14，and there shouldn't be any changes related to interpolateParams, we'll check again!

MOON-CLJ commented 4 years ago

For now my suggestion would be to adapt the parameters you are using to values which are perhaps more appropriate.

thx, we're going to do the same.

I'd be tempted to reduce InstanceWriteBufferSize. Do you have 10,000 hosts or more?

yes, we have more than 10000 hosts

Also InstanceFlushIntervalMilliseconds looks much too high. Why are you flushing only every 10 seconds?

Right now we only have failure detection on，as we enable topology recovery，we will adjust these params。

sjmudd commented 4 years ago

ok. So odd. Clearly something "seems to be lying" I'm not sure what, but clearly this is coming from MySQL due to the 1390 error message from MySQL.

MOON-CLJ commented 4 years ago

We can open the MySQL general log on Monday and observe it, but not on the weekend because it's not compliant. :)

sjmudd commented 4 years ago

So nice setup of yours! I've seen few people share they monitor more than 10,000 under orchestrator.

I still think that you need to write the instance information more frequently. The SQL code that checks for failures uses state that is in the database. It runs every few seconds. If you've not flushed the data to the database it will be checking against old, potentially invalid data, so may not recognise failure scenarios properly. That would not be good.

I assume you're also using GTID on the hosts being monitored and also using a MySQL backend. If you use MySQL, then consider trying 8.0.20. I've seen binlog space reduce significantly if you use 8.0.20 as there's a new parameter to compress binlogs. When used with orchestrator for me this reduced 24h binlog size dropping from about 420 GB to 36 GB using minimal RBR, a huge decrease in size. There are multiple ways to setup the backend orchestrator database so your setup may be different, but if you are using 8.0, then trying out 8.0.20 with binlog compression is really worthwhile. Note: most of what I've seen triggering the huge binlog space is GTID information of the monitored hosts. In my environment the masters of the clusters may change quite frequently for reasons including upgrades, maintenance windows on servers in each of the clusters being monitored etc so gtid_executed may contain a large uuid list. This is stored by orchestrator for each host and it adds up. The compressed binlogs help a huge amount, and also reduce i/o on the backend mysql server used by orchestrator.

sjmudd commented 4 years ago

Thanks, if you can confirm if you're seeing placeholders on the orchestrator backend that would be useful. I'd guess you are, but at least it would allow us to narrow down where this is coming from as it's not expected (by me). Perhaps @shlomi-noach has ideas as this slips out of direct scope of the original code that might have been assumed to trigger this.

MOON-CLJ commented 4 years ago

I still think that you need to write the instance information more frequently. The SQL code that checks for failures uses state that is in the database. It runs every few seconds. If you've not flushed the data to the database it will be checking against old, potentially invalid data, so may not recognise failure scenarios properly. That would not be good.

thx for your advance，we should later set both InstanceFlushIntervalMilliseconds and InstancePollSeconds to 3s.

So nice setup of yours! I've seen few people share they monitor more than 10,000 under orchestrator.

thx for all @shlomi-noach and you and other's guy's great works on orchestrator。

I assume you're also using GTID on the hosts being monitored and also using a MySQL backend. If you use MySQL, then consider trying 8.0.20.

thx，currently we use 5.7.26，we haven't started using 8.0 yet，your suggestion is one reason we use 8.0.

MOON-CLJ commented 4 years ago

We can open the MySQL general log on Monday and observe it, but not on the weekend because it's not compliant. :)

right now, here are the results of the query from one backend,

SHOW GLOBAL STATUS LIKE 'com_%prepare%';
+--------------------+--------+
| Variable_name      | Value  |
+--------------------+--------+
| Com_prepare_sql    | 0      |
| Com_stmt_prepare   | 240661 |
| Com_xa_prepare     | 0      |
| Com_stmt_reprepare | 0      |
+--------------------+--------+
4 rows in set (0.00 sec)

only orchestrator and our cron job script(using python) will query backend。As you probably know, python's MySQL driver doesn't generally use the prepare statement。anyway we'll confirm on the MySQL general log on monday。

sjmudd commented 4 years ago

My backend which has been up 9 days says:

+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| Com_prepare_sql    | 0     |
| Com_stmt_prepare   | 38    |
| Com_xa_prepare     | 0     |
| Com_stmt_reprepare | 0     |
+--------------------+-------+
4 rows in set (0.01 sec)

The difference looks suspicious and I suspect this comes from adhoc requests outside of orchestrator.

FWIW the upgrade to 8.0 was completely painless for me.

If you are using slaves then upgrade a slave to 8.0.20 and switchover to it. That reduces downtime. Should that fail you can easily switch back to the original backend and any lost writes are not really a big deal.

Then enabling binlog_transaction_compression = ON # default: OFF can be done online so that's a trivial change.

I assume you're aware of the setting to temporarily disable global recoveries (disable-global-recoveries) which makes orchestrator maintenance much easier. Works from the web or cli interfaces. That's proved very useful for us.

And looking at Com_insert and Com_update I see these sum up to tens of millions.

sjmudd commented 4 years ago

You mentioned setting InstancePollSeconds to 3 seconds. That leads me to a goal of moving the logic that recognises failures to memory on the active orchestratorapp instance and with the continual update of state changes in the instances being monitored would be able to instantly be able to recognise a failure event and act immediately. This would be much faster than polling every X seconds so would be much better, but requires a complete rewrite of the logic to detect failure to run entirely in memory and to be event driven.

Also I’d like to ask: I’ve seen latency issues when the monitored instances are in different AZs. The latency of polling the MySQL instances in a local AZ vs a remote AZ can be quite noticeable and under large failure scenarios this can be problematic.

It sounds like you still don’t have this running live in production, but have you noticed this too?

Another thing I see as being useful is if you have more than one orchestratorapp is to have them all share the polling between them, ideally polling mainly local MySQL instances as this should improve performance considerably. Do you run with more than one orchestratorapp running at once? I assume that your MySQL apps are spread over multiple AZs. Is that correct?

MOON-CLJ commented 4 years ago

This would be much faster than polling every X seconds so would be much better, but requires a complete rewrite of the logic to detect failure to run entirely in memory and to be event driven.

"poll" => "epoll" should be nice。

It sounds like you still don’t have this running live in production, but have you noticed this too?

yes, we have multi data centers，but due to this issue，we have lots of "Problem: not recently checked". Only if this issue is resolved can we observe the "latency of polling the MySQL instances in a local AZ vs a remote AZ" further.

Do you run with more than one orchestratorapp running at once?

do you mean multi orchestrator "cluster"？yes in plan，because of cross-region needs, but currently not。

shlomi-noach commented 4 years ago

Hi, as @sjmudd commented, interpolateParams=true is hard coded for all connections, an dorchestrator should not create any prepared statements. I know for fact this works (worked) well at my last large setup. I wonder if performance_schema can validate that on your side (I'm not sure what table would show the data).

sjmudd commented 4 years ago

"poll" => "epoll" should be nice。

No, you misunderstand me. This is not about system calls this is about when orchestrator checks if things are broken. This happens periodically but ideally each instance check result should allow orchestrator to determine if something has broken.

... we have lots of "Problem: not recently checked". ...

This is because your flush interval is much too high. Reduce this to something sensible and the data will be stored in the backend database earlier and you won't see these errors any more.

If you are not doing it already please look at the following:

check the queue size. I would NOT expect it to be "full". That implies writes are not being drained to the backend database fast enough, the flush interval is too high, or the time to poll each instance is taking too long.
look at the number of InstancePollSeconds being exceeded.
look at the discovery metrics and check the timings in your environment
look at the failed and successful polls / minute
check how many failed_hosts you are seeing per minute
check how many buffered writes you are making per minute
check how many instances / write you are making
look at the total vs backend poll latency metrics. q95 vs median values may differ significantly and this also depends on the location of the orchestratorapp vs the instances being monitored. Moving orchestratorapp to a different AZ/DC can make a large difference in some cases.

In many cases you may expect things to be absolutely uniform but I've seen in my environment that there can be quite a range in the metrics being monitored. If you're not monitoring these things you won't even be aware if orchestrator is behaving correctly or not.

@MOON-CLJ I'm not sure this is the best place to discuss some of these details. If you want to contact me directly I can probably provide a few more details to help you get setup.

sjmudd commented 4 years ago

I'll share a few of the graphs from my system. All metric values have been removed for obvious reasons but it should give you an idea of the sort of things to be looking at.

Screenshot 2020-07-05 at 11 19 48 Screenshot 2020-07-05 at 11 20 27 Screenshot 2020-07-05 at 11 21 15 Screenshot 2020-07-05 at 11 48 41

If you monitor these values over time, especially if you make changes to the orchestrator configuration, then you can see how it affects things. You should also be able to see when the backend gets saturated or orchestrator is having more trouble than normal. Scanning the orchestrator logs is not really feasible and the front-end is not intended to show you this amount of detail as these things become more important as the monitored fleet grows.

I hope this helps and gives you some more insight into how to see what orchestrator is doing and what to monitor.

sjmudd commented 4 years ago

do you mean multi orchestrator "cluster"？yes in plan，because of cross-region needs, but currently not。

No, I meant do you run more than one orchestratorapp against the same orchestrator backend? I do this as it also ensures I can take down an app without worrying but also because should I lose an AZ/DC some apps are running from different AZ/DCs so that they can take over immediately and I won't be affected. I don't have orchestratorapp containerised but probably should but either way having at least one running in each AZ/DC seems like a good idea.

Cross region is more tricky and latency issues will probably bite you more there too. It's good to get orchestrator working first, understand how it behaves before adding some of these extra complexities into the picture.

sjmudd commented 4 years ago

Perhaps this is helpful: https://github.com/openark/orchestrator/pull/1204 ?

MOON-CLJ commented 4 years ago

No, you misunderstand me. This is not about system calls this is about when orchestrator checks if things are broken.

I didn't misunderstand you. I put the word in quotes.

Reduce this to something sensible and the data will be stored in the backend database earlier and you won't see these errors any more.

yes，we will make changes on workdays.

I hope this helps and gives you some more insight into how to see what orchestrator is doing and what to monitor.

thank you，so nice to stating these details. we do collect these metrics. we will look at these metrics when we make change. we may share some interesting variations on these graphs.

No, I meant do you run more than one orchestratorapp against the same orchestrator backend?

okay, we are not. we do use orchestrator/raft mode，one orchestratorapp run against one MySQL backend in the same data center。

We've been evaluating and using orchestrator for over a year now and only recently started increasing the amount of clusters to monitoring。

MOON-CLJ commented 4 years ago

Perhaps this is helpful: #1204 ?

yes，it's helpful。we will gradually adjust these configurations。

MOON-CLJ commented 4 years ago

gave some update。

Now this is a bit strange, we have the general log open, but the prepare statement was not found in the general log. But we adjusted the InstanceWriteBufferSize parameter to 900 and InstanceFlushIntervalMilliseconds is 100. orchestrator's log no longer has "Error 1390" in it, and "Problem: not recently checked" is gone. And Com_stmt_prepare no longer grows every 10s.

@sjmudd @shlomi-noach

sjmudd commented 4 years ago

I'm pleased the change in parameters has helped even if it hasn't clarified the cause of the original problem you reported. Does this need us to follow up? I think the intended behaviour is clear. Why you saw what you did is not clear, but I don't have a lot of time to follow up on that now.

openark / orchestrator

bug report，InstanceWriteBufferSize cant be more than about 1000 #1203