Open vijayp opened 8 years ago
Hi @vijayp -- could you run keybase log send
on each of the devices so we can get a better sense about what's going on? It could be that one of the devices is in a "conflicted" state, but we should be able to automatically recover from that, so I'll need to see the logs to know what's really going on.
Also, can you please give us some context about what you were doing to get into this state? Were you trying to do lots of writes on both devices at the same time?
Hi @vijayp -- you are running version 1.0.0-20160204170048+c846d32
of KBFS on your device mb11
, which is over two months old. I believe the bug I describe below is fixed in later versions. Can you please try keybase update run
, and see if that fixes things? Note that it might take a while to fix itself, given how far behind it is.
It seems that this device was offline/suspended for about an hour and missed thousands of writes from a different device. As soon as the laptop/process woke back up, it began to write again. Because it hadn't yet seen the other writes, this device went into a conflicted mode, where it wrote its data to the server on a different "branch", intending to resolve the conflicts in the background asynchronously. (When a device writes to a branch, its changes aren't visible to any other devices.)
Conflict resolution can't run to completion until the writes settle down. In this case it happened after another few thousand writes on the conflict branch. At some point, one of the FS operations were interrupted (possible with a ctrl-c?), leading to a situation where the device didn't know its latest write had succeeded on the conflict branch, so now the device was in conflict with itself. That's usually fine, but future writes will fail until conflict resolution succeeds.
In this case, conflict resolution hit a snag (putting the log message here for search reasons):
2016-04-12T10:09:12.285018 ▶ [DEBU kbfs(CR 4393fdf7) conflict_resolver.go:2997] 1e79b6 Finished conflict resolution: No chain found for BlockPointer{ID: 018d8c2eccffde3fd7e224c68c58e2914579fb08ee51420175c3b38b237a3b8fb3, KeyGen: 1, DataVer: 1, Creator: 8827b4ad1e95ee0ead52a3fc5d24de19} [tags:CRID=ASOw5FV6awTZPvAuGcDSTg]
All future attempts at conflict resolution fail with the same error. I believe this error has been fixed since the version you are running. I'm not 100% positive that it is fixable with an update, but I think that's the first thing to try. After you update, please give it some time (an hour? depends on your network connection) and see if it successfully resolves the conflict. If not, check back with us. At that point, probably the best thing to do will be for us to delete your conflict branch for mb11
(losing whatever data you wrote on that device while you were conflicted).
Sorry for the inconvenience!
alright, i've update keybase and i'm leaving it plugged in with caffeinate blocking to prevent it from going to sleep. let's see how it goes. Happy to wipe the branch and re-upload if that would be helpful.
On Tue, Apr 12, 2016 at 6:03 PM, Jeremy Stribling notifications@github.com wrote:
Hi @vijayp https://github.com/vijayp -- you are running version 1.0.0-20160204170048+c846d32 of KBFS on your device mb11, which is over two months old. I believe the bug I describe below is fixed in later versions. Can you please try keybase update run, and see if that fixes things? Note that it might take a while to fix itself, given how far behind it is.
It seems that this device was offline/suspended for about an hour and missed thousands of writes from a different device. As soon as the laptop/process woke back up, it began to write again. Because it hadn't yet seen the other writes, this device went into a conflicted mode, where it wrote its data to the server on a different "branch", intending to resolve the conflicts in the background asynchronously. (When a device writes to a branch, its changes aren't visible to any other devices.)
Conflict resolution can't run to completion until the writes settle down. In this case it happened after another few thousand writes on the conflict branch. At some point, one of the FS operations were interrupted (possible with a ctrl-c?), leading to a situation where the device didn't know its latest write had succeeded on the conflict branch, so now the device was in conflict with itself. That's usually fine, but future writes will fail until conflict resolution succeeds.
In this case, conflict resolution hit a snag (putting the log message here for search reasons):
2016-04-12T10:09:12.285018 ▶ [DEBU kbfs(CR 4393fdf7) conflict_resolver.go:2997] 1e79b6 Finished conflict resolution: No chain found for BlockPointer{ID: 018d8c2eccffde3fd7e224c68c58e2914579fb08ee51420175c3b38b237a3b8fb3, KeyGen: 1, DataVer: 1, Creator: 8827b4ad1e95ee0ead52a3fc5d24de19} [tags:CRID=ASOw5FV6awTZPvAuGcDSTg]
All future attempts at conflict resolution fail with the same error. I believe this error has been fixed since the version you are running. I'm not 100% positive that it is fixable with an update, but I think that's the first thing to try. After you update, please give it some time (an hour? depends on your network connection) and see if it successfully resolves the conflict. If not, check back with us. At that point, probably the best thing to do will be for us to delete your conflict branch for mb11 (losing whatever data you wrote on that device while you were conflicted).
Sorry for the inconvenience!
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/keybase/keybase-issues/issues/2203#issuecomment-209122731
Thanks @vijayp. I can tell from our server logs that it's probably not fixed by the update. It's possible the bug is rooted in the state that has already been written, rather than the code. (I can't find the exact issue where we fixed this problem, but I definitely remember seeing that same error and fixing it in the last two months.)
Can you check the KBFS log for recent lines complaining about "No chain found", just to make sure the same thing is still happening? If so, I'd recommend just wiping the branch if you're cool with it. If you want us to do that, it's best to get your public permission. Could you please run the following command on one of your devices and post the output here, if you want that? Make sure to substitute the current date and time where indicated.
keybase sign -m "<DATE_AND_TIME>: Please clear the conflict branches for folder /keybase/private/vijayp."
Hm, there's no line saying no chain found, but the missing files are also not there. Should we still wipe this conflict branch? I guess I also don't fully understand why there would be conflict branches.
If each file is encrypted with a different salt, there could be potential conflicts between specific files, which only a client could resolve. Is that what's happening? The FS doesn't need to adhere to any ordering, so I guess I don't see why we need to delete all the files, and we can't just eliminate the one conflict that's somehow stuck?
Anyway, if you believe this is the fastest way to fix this, I'm happy to sign that message
vijayps-MacBook:vijayp vijayp$ grep -i 'No chain found' ~/Library/Logs/keybase.kbfs.log vijayps-MacBook:vijayp vijayp$
On Tue, Apr 12, 2016 at 6:45 PM, Jeremy Stribling notifications@github.com wrote:
Thanks @vijayp https://github.com/vijayp. I can tell from our server logs that it's probably not fixed by the update. It's possible the bug is rooted in the state that has already been written, rather than the code. (I can't find the exact issue where we fixed this problem, but I definitely remember seeing that same error and fixing it in the last two months.)
Can you check the KBFS log for recent lines complaining about "No chain found", just to make sure the same thing is still happening? If so, I'd recommend just wiping the branch if you're cool with it. If you want us to do that, it's best to get your public permission. Could you please run the following command on one of your devices and post the output here, if you want that? Make sure to substitute the current date and time where indicated.
keybase sign -m "
: Please clear the conflict branches for folder /keybase/private/vijayp." — You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/keybase/keybase-issues/issues/2203#issuecomment-209134441
@vijayp: if you're willing to put the latest log dump in /keybase/private/strib,vijayp
, I can take a look. Note that newer versions of KBFS actually roll over logs, so please include everything that matches ~/Library/Logs/keybase.kbfs.*
. (That may explain why your grep didn't catch the error, though maybe it's legitimately some other bug.)
I'll write up a separate comment about why conflict resolution happens.
@strib ah, that's it. It was in a rotated file! So I guess this confirms that this is the bug I saw. I'll email you a signed message to wipe the conflict trees then.
vijayps-MacBook:strib,vijayp vijayp$ grep -i 'No chain found' ~/Library/Logs/keybase.kbfs* /Users/vijayp/Library/Logs/keybase.kbfs.log:2016-04-13T15:19:25.023461 ▶ [DEBU kbfs(CR 4393fdf7) conflict_resolver.go:3102] 353ab0 Finished conflict resolution: No chain found for BlockPointer{ID: 01f23be466842142875a81f31bbe40d2a794fc5d3b513d450fbd8666001a975719, KeyGen: 1, DataVer: 1, Creator: 8827b4ad1e95ee0ead52a3fc5d24de19} [tags:CRID=gmdwWHLR9pVpnsCIba-jEw] /Users/vijayp/Library/Logs/keybase.kbfs.log:2016-04-13T15:23:58.679561 ▶ [DEBU kbfs(CR 4393fdf7) conflict_resolver.go:3102] 36c3ec Finished conflict resolution: No chain found for BlockPointer{ID: 01cd02d72bddaa5177ce1f900fce468228121337fe6a227c44d50f0793241360a8, KeyGen: 1, DataVer: 1, Creator: 8827b4ad1e95ee0ead52a3fc5d24de19} [tags:CRID=junWbeWXXQBZTaCLeplKTg]
Re: conflict resolution: our servers maintain strict consistency on the "main" view of a folder. If a device pushes an update that doesn't directly follow on the latest known update, our server rejects it. As you say, this is because the server doesn't know anything about the contents of the folder, and cannot determine for itself whether the update is mergable or not. More than just different salts, the server doesn't even know anything about file names. All it sees is a top-level metadata entry for the folder (including a strict monotonic revision number) and a set of content-addressed blocks.
So that's why it's up to the device to do the conflict resolution, and then eventually push a new update to the main branch of the folder including the complete resolution. It does mean that during periods of heavy, multi-device writing, the view of the folder between different writing devices will diverge temporarily. And in the case of horrible, shameful bugs like this, the divergence is permanent until we intervene.
Ah ok, great!
Is there a doc that describes this protocol? I'm curious about how it's actually implemented. Or we could maybe discuss in person sometime! I'm still not 100% clear on the implications of this in terms of which conflict wins. Are the conflicts always processed in the same order regardless of client? I guess if the outcome is nondeterministic (either because of divergent client implementation or randomness in the process) I wonder whether there really is a benefit to enforcing a consistent main view.
This doc has a "Conflict Resolution" section that attempts to go through the main cases, though I'm sure a few corner cases fall through the cracks since it's not a technical, systematic discussion. Happy to discuss in person or on hangout (I'm in SF) any time.
Conflict resolution on a given device, against a given revision of the main branch, is always deterministic. If the device finishes conflict resolution and tries to update the main branch with the resolution, and fails because some other update beat it there first, it just starts all over again (hopefully being able to re-use some of the work it did last time). Each device does conflict resolution between its own local conflict branch and the main branch, so there's never multiple devices trying to resolve the exact same conflicts. Does that make sense?
oh, that's interesting. I'll read the doc. I guess this means I have to keep all my devices on and potentially connected to the internet until conflicts are resolved. That's an interesting decision given that conflicted branches are stored in kb's servers. I suppose it would be more complicated if any device could resolve any other device's conflict.
Yeah, it's true. And we could change that in the future; there's no real architectural reason other devices can't do the resolution, it's just simpler not to at this point. But in general, barring bugs and epic device write battles, conflict resolution should be very fast, and most users wouldn't even notice that it's happened.
Also, since the data is stored on our servers, you don't have to keep the device connected until the resolution happened, you just have to re-connect it eventually.
Ok @vijayp that conflict branch for your device has been wiped. Try restarting KBFS on that device, and you should see that main view of the folder again.
I am in similar situation. My MacOS Keybase files are conflict and auto resolution not happening.
▶ ERROR MD revision 31 isn't next in line for our current revision 31
(or maybe I don't understand how this is supposed to work). I have two computers, "new-host-8" (don't ask) and "vijayps-macbook"
new-host-8:vijayp vijayp$ find /keybase/private/vijayp | wc -l 3075 vijayps-MacBook:vijayp vijayp$ find /keybase/private/vijayp | wc -l 1761
Diffing the files:
new-host-8:vijayp vijayp$ find /keybase/private/vijayp >/tmp/a new-host-8:vijayp vijayp$ ssh vijayps-MacBook find /keybase/private/vijayp | sort > /tmp/b Password: new-host-8:vijayp vijayp$ find /keybase/private/vijayp | sort >/tmp/a new-host-8:vijayp vijayp$ diff /tmp/{a,b}
shows legitimate file differences.
I killed and restarted all daemons on my macbook, but those files are still not showing up. Any other suggestions? They seem to be showing up on my desktop, but now I'm wondering whether they're actually in cache or there for real. Maybe I should reinstall everything and see?