Freeze in current develop branch

evgmik commented 3 years ago

I see strange freezes in the current develop branch add27490.

How it looks. After the brig starts everything works fine. But after about 25 minutes, I see in the log

17.01.2021/12:26:14 ⚠ net/server.go:369: failed to authenticate connection: Auth package is oversized: 8316306196091711251
17.01.2021/12:34:37 ⚠ net/server.go:369: failed to authenticate connection: Auth package is oversized: 8316306196091711251

The size of connection is scary big, but it is unclear what triggers it. brig clients were not running at this moment.

sahib commented 3 years ago

Interesting. What do you mean by "freeze", i.e. what freezes? Complete daemon handling or high system load?

evgmik commented 3 years ago

Yep. Nothing responds. In particular fuse mounts. If I kill the daemon, it does proper shutdown, and after restart everything is back.

evgmik commented 3 years ago

It seems to be triggered by repin. I have a file (/trand2) which are not present in ipfs, i.e. I removed that pin from ipfs and there is no replica of it anywhere. So freeze starts at when

17.01.2021/13:21:57 ⚐ catfs/repin.go:250: repin started (min=1 max=10 quota=5GB)
17.01.2021/13:21:58 ⚠ catfs/repin.go:110: The <file /trand2:2W9rNbfuMjsWRTjeDoqU7hRNSy9ormEpcLLvsjj1gk1ASeqk8b:9> should be cached, but it is not. Recaching

As soon as I see Recaching in the log, everything freezes. About 4 minutes later (I guess it is the timeout), we see failed to authenticate connection: Auth package is oversized:.... but everything is still frozen.

This is weird, since recaching should be equivalent of ipfs pin hash, it should not be blocking.

evgmik commented 3 years ago

Confirming, freeze starts after repin kicks in. I reduced repin timeout to 10s and saw brig freeze in about 10s after start. Note the freeze condition requeres missing hash at ipfs, as described above.

evgmik commented 3 years ago

After some digging through ipfs docs, I found that the ipfs pin add is syncronious there is a potential to have async capabilities but it is far in ipfs backlog.

Consequently, we should be very careful when we request a pin which is not present in ipfs globally. ipfs pin add apparently has no timeout and will never end, and thus repinner will not stop too.

pin add is fundamental for brig this how it gets files to the backend cache. But it is a dangerous operation if pin is missing. On top of it repinner must not block. Any ideas how to achieve it? Ideally, our repinner should create a queue of pins to be added, and kill process which take too long after a given timeout. What would be an example of such code?

sahib commented 3 years ago

Just to make sure: did you check that brig actually blocks in the Pin() call? I also wanted to mention that you can always kill a blocked process CTRL+\. This will make it print out a stack trace of all go routines, which makes it easier to see where it blocks at that moment.

pin add is fundamental for brig this how it gets files to the backend cache. But it is a dangerous operation if pin is missing. On top of it repinner must not block. Any ideas how to achieve it? Ideally, our repinner should create a queue of pins to be added, and kill process which take too long after a given timeout. What would be an example of such code?

Were you pinning files that are not available local? Or asked differently: What changed that made the repinner block?

But generally, yes, we should make sure that all pins are actually carried through. This is a classic queuing pattern: In our application logic we should treat pinning as an non-blocking operation. Internally the repinner would add each pin/unpin request to a persistent queue that is then being worked upon by a number of worker go-routines. On a brig restart we would check on items that are still in queue and retry them. I'd recommend BadgerDB as a store for the queue, since we're using it anyways. It's a key value store that you can iterate fast in lexicographic order, so you can just a timestamp as key to make sure

As a workaround for now we can start each pin of the repinner in a separate go-routine. That will not handle things correctly when brig crashes or restarts, but that way you can focus on your fuse work, before jumping topics.

evgmik commented 3 years ago

Just to make sure: did you check that brig actually blocks in the Pin() call?

YEs, I am quite sure. I sprinkled debug info around https://github.com/sahib/brig/blob/c29348532553a533edee7e249574b48b8cb51f23/catfs/repin.go#L111 The call for fs.bk.Pin was called, but never come back.

I also wanted to mention that you can always kill a blocked process CTRL+\. This will make it print out a stack trace of all go routines, which makes it easier to see where it blocks at that moment.

This one is nice feature. Did not know about it.

Were you pinning files that are not available local? Or asked differently: What changed that made the repinner block?

I unpined the underlying ipfs hash and run ipfs repo gc to clear the hard drive space. But the reference was still in the repiner need to have allocation, so it was calling for it.

But generally, yes, we should make sure that all pins are actually carried through. This is a classic queuing pattern: In our application logic we should treat pinning as an non-blocking operation. Internally the repinner would add each pin/unpin request to a persistent queue that is then being worked upon by a number of worker go-routines. On a brig restart we would check on items that are still in queue and retry them. I'd recommend BadgerDB as a store for the queue, since we're using it anyways. It's a key value store that you can iterate fast in lexicographic order, so you can just a timestamp as key to make sure

As a workaround for now we can start each pin of the repinner in a separate go-routine. That will not handle things correctly when brig crashes or restarts, but that way you can focus on your fuse work, before jumping topics.

Well, I was clearly messing with hashes, so it is not likely to be triggered, unless different instances have different pin depth.

I will put this bug on hold and focus on fuse layer.

sahib commented 3 years ago

Will be tackled in #91.

sahib / brig

Freeze in current develop branch #87