outbrain / orchestrator-agent

MySQL replication topology manager - agent (daemon)
Apache License 2.0
35 stars 53 forks source link

NaN undefined disk usage orchestrator agent #12

Closed hagay3 closed 8 years ago

hagay3 commented 8 years ago

Hi, As I'm using orchestrator to seed nodes from time to time, I've seen issue with the parameter "disk usage". Its seems that after formatting the partition it takes to orchestrator to initialize "disk usage" 15 minutes or so. And with undefined disk usage seeding node fail.

There is a way to trigger this parameter initialize ?

shlomi-noach commented 8 years ago

Can you please do the next experiment: during the 15 minutes after formatting the partition, what happens when you call /api/du (i.e. http://your.agent.host:3001/api/du) ? Likewise, what do you get for /api/mysql-du?

Browsing the code, I see no reason why it would take 15 minutes to get the value.

Another thing I'm unsure of from your question:

after formatting the paritition

Who is doing the formatting? Which formatting is that? Do you mean orchestrator-agent's cleanup of the directory? (unlikely), or some manual step of yours while provisioning a new box?

Please clarify.

shlomi-noach commented 8 years ago

@hagay3 word?

hagay3 commented 8 years ago

I can`t reach api in this way, I get "invalid token" error { Code: "ERROR", Message: "Invalid token", Details: null }

shlomi-noach commented 8 years ago

select hostname, token from orchestrator.host_agent

Add &token=... to URL

hagay3 commented 8 years ago

404 page not found url end like: hostname:3002/api/mysql-du&token=cb3842f7dd6321d518e5d133134bff6871c90ce75facef279ffab9d3308f6190 agent listens on port 3002

shlomi-noach commented 8 years ago

hostname:3002/api/mysql-du?token=cb3842f7dd6321d518e5d133134bff6871c90ce75facef279ffab9d3308f6190

notice the ?

hagay3 commented 8 years ago

I checked it and it returns

{
Code: "ERROR",
Message: "Invalid token",
Details: null
}

Other slaves - api works fine Also if I wait the 15 minutes its ok too an then space is not undefined.

shlomi-noach commented 8 years ago

OK, possibly the agent crashes right after the "formatting". At this point, I'd like you to please relate to an earlier question:

Who is doing the formatting? Which formatting is that? Do you mean orchestrator-agent's cleanup of the directory? (unlikely), or some manual step of yours while provisioning a new box? Please clarify.

Also please look at the logs and see whether the agent crashes and restarts.

hagay3 commented 8 years ago

I`m referring to the following cases the error comes to "undenfined disk usage":

There is a way to trigger the check of the data dir usage?

hagay3 commented 8 years ago

I tried to let orchestrator remove data and it comes with an error: Erasing MySQL data on **hostname** Get http://**hostname**:3002/api/delete-mysql-datadir?token=eebe6bdedaf36a8ceb6c1f8d053d7d37411323f0f5dbdd660dd59131799ccd69: net/http: timeout awaiting response headers

shlomi-noach commented 8 years ago

@hagay3 if this is a completely different question please open a new issue.

hagay3 commented 8 years ago

Some server stuck on "NaN undefined" error even after restart orchestrator-agent and server reboot. Seems like its the same issue, orchestrator agent log:

2016/04/14 08:01:57 http: multiple response.WriteHeader calls [martini] Completed 500 Internal Server Error in 6.858183ms [martini] Started GET /api/mysql-error-log-tail for ***106.77:58857 [martini] Completed 200 OK in 3.682144ms

shlomi-noach commented 8 years ago

@hagay3 , I'm sorry, this is getting more and more confusing. Let's try and simplify this. Please answer the following:

  1. Does this only happen on a specific box, and all other boxes behave well?
  2. Please verify and confirm you only have one daemon of orchestrator-agent running on that box. Do sudo ps aux | grep orchestrator-agent
  3. Why are you rm -rf ${DATADIR}? You should probably rm -rf ${DATADIR}/* instead
  4. If you're reporting such an HTTP output such as Erasing MySQL data on **hostname** Get http://**hostname**:3002/api/delete-mysql-datadir?token=eebe6bdedaf36a8ceb6c1f8d053d7d37411323f0f5dbdd660dd59131799ccd69: net/http: timeout awaiting response headers, please make sure to accompany it by actual orchestrator-agent logs of the same timeframe. I really cannot do much with such lacking info.
  5. Some server stuck on "NaN undefined" error -- is that the same server? Another server?
  6. Please avoid sending the martini error-log messages. These are just HTTP pieces of data. Instead, I'm looking for the orchestrator-agent specific entries.
  7. Please confirm your /etc/init.d/orchestrator-agent config file starts orchestrator-agent with --debug. If not, make sure it does.
  8. It would make sense that if you're asking orchestrator-agent to remove data you would get HTTP timeout. HTTP will not wait the 27 minutes it would take to erase a large directory.
shlomi-noach commented 8 years ago

Status on this, or can I close?

hagay3 commented 8 years ago

You can close the issue, error stopped show up (didnt had the chance to debug it)