syndicate-storage / syndicate

Internet-scale software-defined storage system
Apache License 2.0
56 stars 10 forks source link

Request retry #81

Closed iychoi closed 10 years ago

iychoi commented 10 years ago

When network is not stable, requests to other Gateways sometimes are lost. Hence, further operations are hanging.

Need timeout and retry logic is necessary.

jcnelson commented 10 years ago

By default, Syndicate waits 10 seconds to connect and 300 seconds on transfer before timing out (see md_default_conf() in libsyndicate/libsyndicate.cpp). This might be too long for you. You can override them in a config file, using METADATA_CONNECT_TIMEOUT="xxx" and TRANSFER_TIMEOUT="xxx", respectively.

By default, the Gateway methods are meant to fail fast (no retry), so the application can decide how to handle the error. However, if you want to add network-level retries in your branch, all the relevant methods are in UG/network.cpp (particularly fs_entry_download_manifest(), fs_entry_download_block(), fs_entry_post_write(), and fs_entry_send_write_or_coordinate()).

iychoi commented 10 years ago

Okey, reducing TRANSFER_TIMEOUT value will be helpful. But still I'm not sure where to put retries. UG-IPC or H-Syndicate(Hadoop-Syndicate connector) can potentially have this. Let me figure out how NFS works for this.

iychoi commented 10 years ago

It seems NFS has built-in retries for requests as it uses UDP for transport protocol. Also providing retries on Syndicate UG(Both UG-FUSE and UG-IPC) level would provide better compatibility to other user programs. How do you think?

jcnelson commented 10 years ago

I think we can add support for retrying up to a certain number of times (probably no more than 5 times), if they can be done in rapid succession (i.e. no more than a few seconds between attempts). I'd want the UG to handle transient message loss transparently, but longer-lived failures (such as a chronic partition or offline gateway) shouldn't hang the system.

iychoi commented 10 years ago

Get it, I'll look into this after M.S. Thesis :-)

jcnelson commented 10 years ago

UGs now re-try manifest requests between one another, as of db4f18e44032d8379ba6b4116150a42a72366f67.

jcnelson commented 10 years ago

Metadata reads from the MS will now be retried at most max_read_metadata_retry (in struct md_syndicate_conf), and manifest reads from gateways will now be retried at most max_read_retry (in struct md_syndicate_conf). You can set default values in md_default_conf() in libsyndicate/libsyndicate.cpp.

iychoi commented 10 years ago

Alright thanks :+1: