Cannot open document with fresh installation

ericvicenti commented 1 year ago

After the user enters the app for the first time and tries to open a document by URL in the quick-switcher, the document cannot load

This will result in bad experiences when people click "open in mintter app" from the website, even after they have the app running

For example try this document url, which loads fine in the gateway: https://hyper.media/d/FHD735zUfVznrESN5s9JMz

juligasa commented 1 year ago

the problem with that document is that the gateway does not provide it according to the DHT: Only one peer provides it (12D3KooWSTPg42qfgeHdjqxLM3mWSaXVUefkPNXrm1knfaB568dW) and it's probably the offline Dev Bot

ipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu
12D3KooWSTPg42qfgeHdjqxLM3mWSaXVUefkPNXrm1knfaB568dW

Since there is One peer claiming to have it maybe peerswap does not ask direct connected peers.

Should we force to scan connected peers?
Should we force the gateway to re-provide more often? I see we had defined const defaultReprovideInterval = 12 * time.Hour but it seems unused now.

juligasa commented 1 year ago

What if we fail to find it in the DHT we fallback to ask the gateway instead of all connected peers? But we only want that document not all the documents the provider has so we sill have to pass initialObjects here https://github.com/MintterHypermedia/mintter/blob/main/backend/syncing/discovery.go#L60 WDYT @burdiyan ?

juligasa commented 1 year ago

I think the reproviding problem has to do with WHO reprovides rather than WHAT is being reprovided. Here is the experiment. Reprovide as the gateway:

Took the gateway folder with hundreds of CIDs to reprovide, among which we have bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu (which correspond to hm://d/FHD735zUfVznrESN5s9JMz)
Run that locally 2.1 I see in our logs to successfully send bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu CID to the reproviding system. 2.2.I also see in DHT logs that it's providing keys (does not show what keys), so its consuming CIDs, bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu is the first CID sent to reprovide
In another machine, run ipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu with no luck. empty response

Reprovide as a new peer

Took the gateway db included inside the previous gateway folder. This time only the db (with hundreds of CIDs to reprovide, among which we have bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu) is copied
Run with a new identity different from the one before. 2.1 I see in our logs to successfully send bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu CID to the reproviding system. 2.2.I also see in DHT logs that it's providing keys (does not show what keys), so its consuming CIDs, bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu is the first CID sent to reprovide
In another machine, run ipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu but this time, I can instantly see the peerID of the second node

For this reason, I think DHT is banning gateway peerID maybe they think we are a bad actor, constantly flooding the DHT network? WDYT @burdiyan ?

burdiyan commented 1 year ago

@juligasa Thanks for the feedback!

I would hope that if a DHT node would not accept the providing record we’d get some error logged, but we don’t see any. That is what concerns me and seems weird.

Have you seen any weird errors when providing as a gateway? Maybe try redirecting stdout logs to a file, because DHT logs like crazy and the terminal buffer overflows too quickly to be able to see anything.

juligasa commented 1 year ago

@burdiyan After inspecting the logs I only see one type of logs coming from the provider.batched module:

2023-11-03T09:54:21.752+0100    DEBUG   provider.batched        provider/reprovider.go:334      starting provide of 1 keys
2023-11-03T09:54:25.111+0100    DEBUG   provider.batched        provider/reprovider.go:350      finished providing of 1 keys. It took 3.358339984s with an average of 3.358339984s per provide
2023-11-03T09:54:25.111+0100    DEBUG   provider.batched        provider/reprovider.go:334      starting provide of 1 keys
2023-11-03T09:54:30.372+0100    DEBUG   provider.batched        provider/reprovider.go:350      finished providing of 1 keys. It took 5.261156921s with an average of 5.261156921s per provide

burdiyan commented 1 year ago

@juligasa I was talking about DHT logs, not the providing logs. I'm hoping that if DHT providing request fails for some reason we'd get a log about it.

juligasa commented 1 year ago

@burdiyan Now I repeated the experiment again and the CID has been successfully provided by both the gateway and the fresh new peer :disappointed: But it wasn't until I used the gateway database that the CID appeared in the DHT so for some reason, the Real gateway still does not provide it (in local tests, I tweaked the code so bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu is provided first) As for the dht logs, I see a lot of these:

go-libp2p-kad-dht@v0.25.1/dht.go:707    peer found      {"peer": "12D3KooWNqG696UHw4scxw9k8jCLBNfckC2o7Sm9VmD8N3eNK7MA"}
2023-11-08T12:44:26.798+0100    DEBUG   dht     go-libp2p-kad-dht@v0.25.1/dht.go:720    peer stopped dht        {"peer": "12D3KooWBJkb4cduPj8nBHZhcyjYUQf3stNe1BkkTiFhHF6rsXgq"}

and even in the runs that providing ended up working I also saw a lot of these traces:

2023-11-08T12:43:11.856+0100    DEBUG   dht     go-libp2p-kad-dht@v0.25.1/query.go:545  error connecting: failed to dial: failed to dial 12D3KooWSEiUJ375NJwQKu7P5Ayc8AV927mxJemFNDiJ2EK123M1: all dials failed
  * [/ip4/185.92.223.211/udp/4001/quic] QUIC draft-29 has been removed, QUIC (RFC 9000) is accessible with /quic-v1
  * [/ip4/185.92.223.211/udp/4001/quic-v1] conn-1263: system: cannot reserve outbound connection: resource limit exceeded
  * [/ip4/185.92.223.211/tcp/4001] conn-1305: system: cannot reserve outbound connection: resource limit exceeded
  *

This resource limit exceeded comes from the resource manager But I already set:

cfg := rcmgr.PartialLimitConfig{
    System: rcmgr.ResourceLimits{
        Streams: rcmgr.Unlimited,
        Conns:   rcmgr.Unlimited,
        FD:      rcmgr.Unlimited,
        Memory:  rcmgr.Unlimited64,
    },
    Transient: rcmgr.ResourceLimits{
        Streams: rcmgr.Unlimited,
        Conns:   rcmgr.Unlimited,
        FD:      rcmgr.Unlimited,
        Memory:  rcmgr.Unlimited64,
    },
    // Everything else is default. The exact values will come from `scaledDefaultLimits` above.
}

I don't know what is causing that resource limit exceeded. and if that is causing dht providing is sues. Actually, I don't even know why dht needs to connect peer to peer to thousands of peers. Leterally within 5 minutes, I get 94386 traces saying resource limit exceeded

burdiyan commented 1 year ago

Are you setting the resource limit in both mintterd and mintter-site?

I think we should open an issue with: https://github.com/libp2p/go-libp2p-kad-dht

juligasa commented 1 year ago

That part is in the common code shared by both mintterd and mintter-site, however, in the latest code it's only

cfg := rcmgr.PartialLimitConfig{
    System: rcmgr.ResourceLimits{
    Streams: rcmgr.Unlimited,
        Conns:   rcmgr.Unlimited,
}

Which should be enough for the system: cannot reserve outbound connection: resource limit exceeded since it points to system which is the top level conf and connection that points to Conns conf. But now I don't know if my initial test failed bc of the resource manager issue or for who is reprorviding, They should point us in the right direction although is a bit messy. I will open an issue go-libp2p-kad-dht anyway

juligasa commented 1 year ago

Now the document seems to be provided by a lot of peers:

[centos@ip-172-26-14-128 ~]$ ipfs dht findprovs bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfk
nhdk4zzjjgxu
12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro
12D3KooWCTPq7hQ19PK5Zu1oh5ctEfrNuFUPAerrc6reasmGxEcc
12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4
12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk
12D3KooWNDou8c2opXkruc1tii1nREyGRyVcSv39oaKVPgpTefjp
12D3KooWQQF9sSMEGKtf8HsAEiRzGhNgJAvknoL3Dv2E2kRniSbX
12D3KooWLo1YAp2wbLH65H7L1R7GSthzycgqJoWQHC1UemNWwKhz
12D3KooWCEBca769aPZD3aTaCE7YN6VLUoQdKRtDZwwHLYwUFN15

but when putting it on the quick switcher of a fresh new app (ghost town) we can't open it and the errors are:

[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.773+0100       DEBUG   mintter/network mttnet/connect.go:45    ConnectFinished {"peer": "12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro", "error": "failed to handshake with peer 12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: rpc error: code = DeadlineExceeded desc = context deadline exceeded", "Info": "{12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: []}"}
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.773+0100       DEBUG   mintter/syncing syncing/discovery.go:63 FinishedSyncingWithProvider     {"entity": "hm://d/FHD735zUfVznrESN5s9JMz", "CID": "bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu", "peer": "{12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: []}", "error": "failed to handshake with peer 12D3KooWDZzdDc7RJyRmeMB6jDM4k72zs4L1GWqQz8cNTmLziZro: rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.774+0100       DEBUG   mintter/network mttnet/connect.go:45    ConnectFinished {"peer": "12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk", "error": "failed to handshake with peer 12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: rpc error: code = DeadlineExceeded desc = context deadline exceeded", "Info": "{12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: [/ip4/52.22.139.174/tcp/4002/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/23.20.24.146/udp/4002/quic-v1/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/23.20.24.146/tcp/4002/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/52.22.139.174/udp/4002/quic/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/52.22.139.174/udp/4002/quic-v1/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit]}"}
[DESKTOP:APP] 09:56:08.774 › 2023-11-10T09:56:08.774+0100       DEBUG   mintter/syncing syncing/discovery.go:63 FinishedSyncingWithProvider     {"entity": "hm://d/FHD735zUfVznrESN5s9JMz", "CID": "bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu", "peer": "{12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: []}", "error": "failed to handshake with peer 12D3KooWHZ7buw751nfBka2UPZi9UB5yPJD6U1N5hVhfCCGVpLYk: rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
[DESKTOP:APP] 09:56:08.775 › 2023-11-10T09:56:08.774+0100       DEBUG   mintter/network mttnet/connect.go:45    ConnectFinished {"peer": "12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4", "error": "failed to connect to peer 12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: routing: not found", "Info": "{12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: []}"}
[DESKTOP:APP] 09:56:08.775 › 2023-11-10T09:56:08.774+0100       DEBUG   mintter/syncing syncing/discovery.go:63 FinishedSyncingWithProvider     {"entity": "hm://d/FHD735zUfVznrESN5s9JMz", "CID": "bafkqahlinu5c6l3ef5deqrbxgm2xuvlgkz5g44sfknhdk4zzjjgxu", "peer": "{12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: []}", "error": "failed to connect to peer 12D3KooWEAmX8t2aUyZyEkMVppHapC6qqzy8uuBMyeGtiB863jm4: routing: not found"}

No matter how many times I retry. Seems like the problem is in the handshake process which is a remote libp2p call that times out, but we already have a connection between them bc we just did the ´dial´ a few lines before. Any clue @burdiyan ?

juligasa commented 1 year ago

Two problems here:

Why provided CID's does not seem to be actually provided?. HDT problem maybe it has to do with either spamming the network or connection manager issues. Strange since dht should not be aware of the connection manager according to this issue https://github.com/libp2p/go-libp2p-kad-dht/issues/806
Even if it's provided, we cannot connect to the peer providing it bc either is offline or we have handshake issues. I think this is in fact an internal problem of our app. Should we split those two and tackle one in-hose and the other in an issue to dht? @burdiyan ?

burdiyan commented 1 year ago

I think we should focus first on providing and discovering providers. The fact that we can't connect to peers after we've discovered them I think is either a separate issue, or something that we could tackle ourselves. Could be something related to relays or hole-punching, etc. The first and the most mysterious issues IMO is that DHT providing/discovery works unreliably.

seed-hypermedia / mintter

Cannot open document with fresh installation #1482