hep-gc / dynafed_storagestats

Apache License 2.0
2 stars 2 forks source link

check for offline endpoints #24

Closed MarcusEbert closed 5 years ago

MarcusEbert commented 6 years ago

If there is an endpoint currently offline, the program hangs (forever? wait only some minutes).
Needs to have a check for offline endpoints and trying to reach only online endpoints.

ffgalindo commented 6 years ago

For what type of endpoint did you see this happen?

MarcusEbert commented 6 years ago

That was for minio endpoints where the machine didn't run anymore or where the minio process didn't run anymore. (in that case for the one on cc-east and cc-west) Probably happens when any machine is down or nothing listening on the correct port anymore?

ffgalindo commented 6 years ago

The issue was that it was using the libraries (boto3 and requests) connection timeout defaults. And for some reason for boto3, used in the S3 generic, it's infinity. I've explicitly set timeouts for the S3 and DAV methods of 5 seconds. The fix is on "dev" branch. Would it be useful to allow the users to set this timeout via a an option passed to the script?

MarcusEbert commented 6 years ago

Couldn't we use the information dynafed has already about the endpoints? Maybe something like:

What do you think?

EDIT: I mean the max latency defined for using an endpoint.

ffgalindo commented 6 years ago

I think polling memcached to use the online/offline is a good idea. I'll think how to work it into the logic.

For the timeout, rather than the latency, using what is setup as the conn_timeout option would make more sense. And if that is not set, to default it to 5 seconds or so.


From: MarcusEbert notifications@github.com Sent: August 13, 2018 3:23:06 PM To: hep-gc/dynafed_storagestats Cc: Fernando Fernandez Galindo; Assign Subject: Re: [hep-gc/dynafed_storagestats] check for offline endpoints (#24)

Couldn't we use the information dynafed has already about the endpoints? Maybe something like:

What do you think?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/hep-gc/dynafed_storagestats/issues/24#issuecomment-412685407, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcKDEj2qPAd7XRsv0tlKlj6c4Mu1UQ0jks5uQfxKgaJpZM4V4e_t.

MarcusEbert commented 6 years ago

Not sure what is better to use, max latency or conn_timeout. I think conn_timeout can be long because it's the time dynafed let's a client wait - for example after an initial request for a file while dynafed first has to find out which endpoints have it. My understanding is that this is not what dynafed uses to decide if an enpdoint is offline but until a client has to get the link.

Anyway, if an endpoint has a high latency, it is not used by dynafed anyway (and max latency is usually lower than conn_timeout). Also this wouldn't be a final decision but only valid until the next run of the storage plugin at which the latency is lower again and the endpoint gets used.

What I mean is: with latency > max latency the endpoint is not used anyway, so why poll it

I don't know right now what happens when the latency is higher than max latency, maybe the endpoint gets switched to offline then anyway?

ffgalindo commented 6 years ago

I see, you mean the actual latency measured for the endpoint by UGR. I thought you meant the max_latency setting. OK, I can see the appeal for that. I wonder if it would work well with endpoints with really latency, <100 ms. I suppose it should, but maybe it should have a lower bound of at least 1 second or so.

The conn_timeout, at least from the Dynafed manual is the "TCP connection timeout (in seconds) to use when establishing a connection to this endpoint". So I figured if it uses this, then it would be a consistent setting across Dynafed and this script.

I'll finish implementing the status check first and then decide what to use for the timeout.

MarcusEbert commented 6 years ago

No, I mean the max_latency setting. The reason for that is that there are 2 possibilities:
1) measured latency < max_latency then endpoint is used
2) measured latency > max_latency then endpoint is not used even if it is online

So from that it seems max_latency is a good decision to be used as timeout since everything higher than max_latency will not be used anyway.
However, if the real measured latency is just ms or in general lower than max_latency, then it is probably not a good value to be used as timeout as you mentioned. That's why I would propose the max_latency setting since it is used in the decision if an endpoint is used or not.

ffgalindo commented 6 years ago

OK, for now it's just using a fixed 5 seconds time out until I set it to use an option from the config, that should be coming next. For now, it grabs Dynafed's endpoint's connection stats from memcache (if it exists) and compares the list to the ones configured and flags them to be checked or not on the basis of them being online or offline. I've also setup multithreading to speed up the process so that it does not need to wait for each endpoint to reply and be processed to process the rest. It still waits until all of them have been checked before output of data to memcached or stdout. I've also added a -v --verbose option so that it print on stderr the log information according to the loglevel. Regardles of --stdout flag, but if this is set then the stats will be printed at the end of the run.

MarcusEbert commented 6 years ago

Sounds good! I'll test it when I'm back.

ffgalindo commented 5 years ago

conn_timeout setting usually found in /etc/ugr/ugr.conf as a global setting but can be found anywhere in the configuration files or individually for each endpoint, is used to mark timeouts for query/requests.