Closed earthgecko closed 7 years ago
Unfortunately, flock() is only an advisory lock and rsync ignores it. This means that if Graphite updates a particular Whisper file while rsync is in the middle of backing it up, the backup copy might be inconsistent.
Really interesting observation. And one that I hadn't considered.
If rsync is ignoring flock() on specific files, then it seems reasonable that the file can be transmitted in an inconsistent state. However, since whisper will wrap the update transaction in a lock and the sync job will rsync the source file to /tmp on the target host in a batch, then merge the values into the target database, this risk only affects us if whisper is inside a create or update transaction when rsync reads and sends the file.
I'm honestly not sure how to solve this and continue to use rsync, if rsync isn't willing to respect a shared lock. Do you have any ideas here?
Removing the rsync dependency and having carbon-sync use the carbonlink protocol directly to pull data is an option, but one that will require significant work and tuning.
Well if there was a pip libflockit (https://github.com/smerritt/flockit) it could be added to requirements :) Or perhaps something similar with python fcntl.flock()
On either end at the moment we do: flockit.so rsync .wsp file to tmp location -> rsync .wsp to remote tmp location -> flockit.so rsync tmp_remote .wsp to pyschical .wsp file
env LD_PRELOAD=/usr/lib/libflockit.so FLOCKIT_FILE_PREFIX=$WHISPER_PATH rsync $OPTION_THINGS <src> <dest>
would do it all.
But seeing as the file on the dest is being replaced and healed in-situ locally there should be no need to use libflockit.so on the dest unless it is a file that carbon is updating via a relay, etc (e.g. exists).
Are you suggesting we apply a shared lock around the batch of files to sync in https://github.com/graphite-project/whisper/blob/master/whisper.py#L540-L542? I worry about that causing problems by blocking the local carbons, but with a sufficiently small batch size it might be fine.
Yeah we flock rsync per file. Figured it was the only way to ensure the we are not creating a Schrödinger's cat type scenario, without validating every timeseries data set somehow.
I'd be OK with that, but I wonder if it should be default. I'm still worried that since we transfer files in a batch of 1000 at a time to avoid the overhead of setting up the connection for every file we'll cause the carbon processes on the source node to block for an unreasonable amount of time. I'm unsure how to handle this.
Want to send over a PR with your ideas? We can riff on that and get this solved.
Forked, no quick promises though, have libcloud, fog and skyline pull requests that need to be done that are lagging. This social coding malarkey, who thought that was a good idea ;)
hi,
thanks for carbonate! I was thinking about locking too, related to this whisper.LOCK seems to be set by carbon when WHISPER_LOCK_WRITES
is set in graphite's config (see https://github.com/graphite-project/whisper/issues/97) however carbonate doesn't explicitly lock if I'm reading correctly so this is also possible I think:
@filippog - That seems accurate. I believe we want to duplicate the WHISPER_LOCK_WRITES
directive in the carbonate config file, modify the import so we have access to LOCK
and set it accordingly so update_many()
calls flock before making changes. Does this sound right? Want to take a stab at a PR or should I?
sounds good @jssjr ! I've split this into #47 so we can followup there, I'll take a stab at a PR shortly!
Great effort, whisper can be a pain.
I have always been led to believe that "Graphite will lock the Whisper file using flock() during updates so that no other Graphite process reads it mid-update and finds inconsistent data.
Unfortunately, flock() is only an advisory lock and rsync ignores it. This means that if Graphite updates a particular Whisper file while rsync is in the middle of backing it up, the backup copy might be inconsistent." from the boys SwiftStack - @smerritt
Having looked through the code I do not see any reference to flock in anyway. Is this just trivial because of the context of healing to local whisper file? If so, adding an ssh key option pull request is in order for me I think.