Closed evgmik closed 3 years ago
Interesting. What do you mean by "freeze", i.e. what freezes? Complete daemon handling or high system load?
Yep. Nothing responds. In particular fuse
mounts. If I kill the daemon, it does proper shutdown, and after restart everything is back.
It seems to be triggered by repin
. I have a file (/trand2) which are not present in ipfs
, i.e. I removed that pin from ipfs
and there is no replica of it anywhere. So freeze starts at when
17.01.2021/13:21:57 ⚐ catfs/repin.go:250: repin started (min=1 max=10 quota=5GB)
17.01.2021/13:21:58 ⚠ catfs/repin.go:110: The <file /trand2:2W9rNbfuMjsWRTjeDoqU7hRNSy9ormEpcLLvsjj1gk1ASeqk8b:9> should be cached, but it is not. Recaching
As soon as I see Recaching
in the log, everything freezes. About 4 minutes later (I guess it is the timeout),
we see failed to authenticate connection: Auth package is oversized:....
but everything is still frozen.
This is weird, since recaching should be equivalent of ipfs pin hash
, it should not be blocking.
Confirming, freeze starts after repin
kicks in. I reduced repin timeout to 10s and saw brig
freeze in about 10s after start. Note the freeze condition requeres missing hash
at ipfs
, as described above.
After some digging through ipfs
docs, I found that the ipfs pin add
is syncronious there is a potential to have async capabilities but it is far in ipfs
backlog.
Consequently, we should be very careful when we request a pin
which is not present in ipfs
globally. ipfs pin add
apparently has no timeout and will never end, and thus repinner will not stop too.
pin add
is fundamental for brig
this how it gets files to the backend cache. But it is a dangerous operation if pin is missing. On top of it repinner
must not block. Any ideas how to achieve it? Ideally, our repinner should create a queue of pins to be added, and kill process which take too long after a given timeout. What would be an example of such code?
Just to make sure: did you check that brig
actually blocks in the Pin()
call? I also wanted to mention that you can always kill a blocked process CTRL+\
. This will make it print out a stack trace of all go routines, which makes it easier to see where it blocks at that moment.
pin add is fundamental for brig this how it gets files to the backend cache. But it is a dangerous operation if pin is missing. On top of it repinner must not block. Any ideas how to achieve it? Ideally, our repinner should create a queue of pins to be added, and kill process which take too long after a given timeout. What would be an example of such code?
Were you pinning files that are not available local? Or asked differently: What changed that made the repinner block?
But generally, yes, we should make sure that all pins are actually carried through. This is a classic queuing pattern: In our application logic we should treat pinning as an non-blocking operation. Internally the repinner would add each pin/unpin request to a persistent queue that is then being worked upon by a number of worker go-routines. On a brig restart we would check on items that are still in queue and retry them. I'd recommend BadgerDB as a store for the queue, since we're using it anyways. It's a key value store that you can iterate fast in lexicographic order, so you can just a timestamp as key to make sure
As a workaround for now we can start each pin of the repinner in a separate go-routine. That will not handle things correctly when brig crashes or restarts, but that way you can focus on your fuse work, before jumping topics.
Just to make sure: did you check that
brig
actually blocks in thePin()
call?
YEs, I am quite sure. I sprinkled debug info around https://github.com/sahib/brig/blob/c29348532553a533edee7e249574b48b8cb51f23/catfs/repin.go#L111
The call for fs.bk.Pin
was called, but never come back.
I also wanted to mention that you can always kill a blocked process
CTRL+\
. This will make it print out a stack trace of all go routines, which makes it easier to see where it blocks at that moment.
This one is nice feature. Did not know about it.
Were you pinning files that are not available local? Or asked differently: What changed that made the repinner block?
I unpined the underlying ipfs
hash and run ipfs repo gc
to clear the hard drive space. But the reference was still in the repiner
need to have allocation, so it was calling for it.
But generally, yes, we should make sure that all pins are actually carried through. This is a classic queuing pattern: In our application logic we should treat pinning as an non-blocking operation. Internally the repinner would add each pin/unpin request to a persistent queue that is then being worked upon by a number of worker go-routines. On a brig restart we would check on items that are still in queue and retry them. I'd recommend BadgerDB as a store for the queue, since we're using it anyways. It's a key value store that you can iterate fast in lexicographic order, so you can just a timestamp as key to make sure
As a workaround for now we can start each pin of the repinner in a separate go-routine. That will not handle things correctly when brig crashes or restarts, but that way you can focus on your fuse work, before jumping topics.
Well, I was clearly messing with hashes, so it is not likely to be triggered, unless different instances have different pin depth.
I will put this bug on hold and focus on fuse layer.
Will be tackled in #91.
I see strange freezes in the current
develop
branch add27490.How it looks. After the brig starts everything works fine. But after about 25 minutes, I see in the log
The size of connection is scary big, but it is unclear what triggers it.
brig
clients were not running at this moment.