Open ericvicenti opened 1 year ago
the problem with that document is that the gateway does not provide it according to the DHT: Only one peer provides it (12D3KooWSTPg42qfgeHdjqxLM3mWSaXVUefkPNXrm1knfaB568dW
) and it's probably the offline Dev Bot
ipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
12D3KooWSTPg42qfgeHdjqxLM3mWSaXVUefkPNXrm1knfaB568dW
Since there is One peer claiming to have it maybe peerswap does not ask direct connected peers.
const defaultReprovideInterval = 12 * time.Hour
but it seems unused now.What if we fail to find it in the DHT we fallback to ask the gateway instead of all connected peers? But we only want that document not all the documents the provider has so we sill have to pass initialObjects
here https://github.com/MintterHypermedia/mintter/blob/main/backend/syncing/discovery.go#L60
WDYT @burdiyan ?
I think the reproviding problem has to do with WHO reprovides rather than WHAT is being reprovided. Here is the experiment. Reprovide as the gateway:
bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
(which correspond to hm://d/FHD735zUfVznrESN5s9JMz
)bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
CID to the reproviding system.
2.2.I also see in DHT logs that it's providing keys (does not show what keys), so its consuming CIDs, bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
is the first CID sent to reprovideipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
with no luck. empty responseReprovide as a new peer
bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
) is copiedbafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
CID to the reproviding system.
2.2.I also see in DHT logs that it's providing keys (does not show what keys), so its consuming CIDs, bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
is the first CID sent to reprovideipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
but this time, I can instantly see the peerID of the second nodeFor this reason, I think DHT is banning gateway peerID maybe they think we are a bad actor, constantly flooding the DHT network? WDYT @burdiyan ?
@juligasa Thanks for the feedback!
I would hope that if a DHT node would not accept the providing record we’d get some error logged, but we don’t see any. That is what concerns me and seems weird.
Have you seen any weird errors when providing as a gateway? Maybe try redirecting stdout logs to a file, because DHT logs like crazy and the terminal buffer overflows too quickly to be able to see anything.
@burdiyan After inspecting the logs I only see one type of logs coming from the provider.batched
module:
2023-11-03T09:54:21.752+0100 DEBUG provider.batched provider/reprovider.go:334 starting provide of 1 keys
2023-11-03T09:54:25.111+0100 DEBUG provider.batched provider/reprovider.go:350 finished providing of 1 keys. It took 3.358339984s with an average of 3.358339984s per provide
2023-11-03T09:54:25.111+0100 DEBUG provider.batched provider/reprovider.go:334 starting provide of 1 keys
2023-11-03T09:54:30.372+0100 DEBUG provider.batched provider/reprovider.go:350 finished providing of 1 keys. It took 5.261156921s with an average of 5.261156921s per provide
@juligasa I was talking about DHT logs, not the providing logs. I'm hoping that if DHT providing request fails for some reason we'd get a log about it.
@burdiyan Now I repeated the experiment again and the CID has been successfully provided by both the gateway and the fresh new peer :disappointed: But it wasn't until I used the gateway database that the CID appeared in the DHT so for some reason, the Real gateway still does not provide it (in local tests, I tweaked the code so bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
is provided first)
As for the dht logs, I see a lot of these:
go-libp2p-kad-dht@v0.25.1/dht.go:707 peer found {"peer": "12D3KooWNqG696UHw4scxw9k8jCLBNfckC2o7Sm9VmD8N3eNK7MA"}
2023-11-08T12:44:26.798+0100 DEBUG dht go-libp2p-kad-dht@v0.25.1/dht.go:720 peer stopped dht {"peer": "12D3KooWBJkb4cduPj8nBHZhcyjYUQf3stNe1BkkTiFhHF6rsXgq"}
and even in the runs that providing ended up working I also saw a lot of these traces:
2023-11-08T12:43:11.856+0100 DEBUG dht go-libp2p-kad-dht@v0.25.1/query.go:545 error connecting: failed to dial: failed to dial 12D3KooWSEiUJ375NJwQKu7P5Ayc8AV927mxJemFNDiJ2EK123M1: all dials failed
* [/ip4/185.92.223.211/udp/4001/quic] QUIC draft-29 has been removed, QUIC (RFC 9000) is accessible with /quic-v1
* [/ip4/185.92.223.211/udp/4001/quic-v1] conn-1263: system: cannot reserve outbound connection: resource limit exceeded
* [/ip4/185.92.223.211/tcp/4001] conn-1305: system: cannot reserve outbound connection: resource limit exceeded
*
This resource limit exceeded comes from the resource manager But I already set:
cfg := rcmgr.PartialLimitConfig{
System: rcmgr.ResourceLimits{
Streams: rcmgr.Unlimited,
Conns: rcmgr.Unlimited,
FD: rcmgr.Unlimited,
Memory: rcmgr.Unlimited64,
},
Transient: rcmgr.ResourceLimits{
Streams: rcmgr.Unlimited,
Conns: rcmgr.Unlimited,
FD: rcmgr.Unlimited,
Memory: rcmgr.Unlimited64,
},
// Everything else is default. The exact values will come from `scaledDefaultLimits` above.
}
I don't know what is causing that resource limit exceeded. and if that is causing dht providing is sues. Actually, I don't even know why dht needs to connect peer to peer to thousands of peers. Leterally within 5 minutes, I get 94386 traces saying resource limit exceeded
Are you setting the resource limit in both mintterd
and mintter-site
?
I think we should open an issue with: https://github.com/libp2p/go-libp2p-kad-dht
That part is in the common code shared by both mintterd
and mintter-site
, however, in the latest code it's only
cfg := rcmgr.PartialLimitConfig{
System: rcmgr.ResourceLimits{
Streams: rcmgr.Unlimited,
Conns: rcmgr.Unlimited,
}
Which should be enough for the system: cannot reserve outbound connection: resource limit exceeded
since it points to system
which is the top level conf and connection
that points to Conns
conf.
But now I don't know if my initial test failed bc of the resource manager issue or for who is reprorviding, They should point us in the right direction although is a bit messy. I will open an issue go-libp2p-kad-dht
anyway
Now the document seems to be provided by a lot of peers:
[centos@ip-172-26-14-128 ~]$ ipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfk
nhdk4zzjjgxu
12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro
12D3KooWCTPq7hQ19PK5Zu1oh5ctEfrNuFUPAerrc6reasmGxEcc
12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4
12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk
12D3KooWNDou8c2opXkruc1tii1nREyGRyVcSv39oaKVPgpTefjp
12D3KooWQQF9sSMEGKtf8HsAEiRzGhNgJAvknoL3Dv2E2kRniSbX
12D3KooWLo1YAp2wbLH65H7L1R7GSthzycgqJoWQHC1UemNWwKhz
12D3KooWCEBca769aPZD3aTaCE7YN6VLUoQdKRtDZwwHLYwUFN15
but when putting it on the quick switcher of a fresh new app (ghost town) we can't open it and the errors are:
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.773+0100 DEBUG mintter/network mttnet/connect.go:45 ConnectFinished {"peer": "12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro", "error": "failed to handshake with peer 12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: rpc error: code = DeadlineExceeded desc = context deadline exceeded", "Info": "{12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: []}"}
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.773+0100 DEBUG mintter/syncing syncing/discovery.go:63 FinishedSyncingWithProvider {"entity": "hm://d/FHD735zUfVznrESN5s9JMz", "CID": "bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu", "peer": "{12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: []}", "error": "failed to handshake with peer 12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.774+0100 DEBUG mintter/network mttnet/connect.go:45 ConnectFinished {"peer": "12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk", "error": "failed to handshake with peer 12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: rpc error: code = DeadlineExceeded desc = context deadline exceeded", "Info": "{12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: [/ip4/52.22.139.174/tcp/4002/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/23.20.24.146/udp/4002/quic-v1/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/23.20.24.146/tcp/4002/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/52.22.139.174/udp/4002/quic/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/52.22.139.174/udp/4002/quic-v1/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit]}"}
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.774+0100 DEBUG mintter/syncing syncing/discovery.go:63 FinishedSyncingWithProvider {"entity": "hm://d/FHD735zUfVznrESN5s9JMz", "CID": "bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu", "peer": "{12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: []}", "error": "failed to handshake with peer 12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
[DESKTOP:APP] 09:56:08.775 › 2023-11-10T09:56:08.774+0100 DEBUG mintter/network mttnet/connect.go:45 ConnectFinished {"peer": "12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4", "error": "failed to connect to peer 12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: routing: not found", "Info": "{12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: []}"}
[DESKTOP:APP] 09:56:08.775 › 2023-11-10T09:56:08.774+0100 DEBUG mintter/syncing syncing/discovery.go:63 FinishedSyncingWithProvider {"entity": "hm://d/FHD735zUfVznrESN5s9JMz", "CID": "bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu", "peer": "{12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: []}", "error": "failed to connect to peer 12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: routing: not found"}
No matter how many times I retry. Seems like the problem is in the handshake
process which is a remote libp2p call that times out, but we already have a connection between them bc we just did the ´dial´ a few lines before. Any clue @burdiyan ?
Two problems here:
I think we should focus first on providing and discovering providers. The fact that we can't connect to peers after we've discovered them I think is either a separate issue, or something that we could tackle ourselves. Could be something related to relays or hole-punching, etc. The first and the most mysterious issues IMO is that DHT providing/discovery works unreliably.
After the user enters the app for the first time and tries to open a document by URL in the quick-switcher, the document cannot load
This will result in bad experiences when people click "open in mintter app" from the website, even after they have the app running
For example try this document url, which loads fine in the gateway: https://hyper.media/d/FHD735zUfVznrESN5s9JMz