intermitent ERR_PUTTING_TO_ROUTING on IPFS.name.publish

algorista commented 1 year ago

the problem

This is an exception that happens practically all the time when trying to publish on IPNS and it makes the use of IPNS unfeasible if there is no reasonable solution.

the debug

I debugged the source within the limits of my time, and I found the source of the problem, which is this line where a .put is made in the DHTDataStore that forwards the .put to all peers in the libp2p DHT.

https://github.com/ipfs/js-ipfs/blob/6be59068cc99c517526bfa123ad475ae05fcbaef/packages/ipfs-core/src/ipns/routing/dht-datastore.js#L27

What happens is that almost always there will be one or another peer that gives TIMEOUT, and this will trigger an EXCEPTION, which is propagated without any treatment to the upper layer that generalizes the error as ERR_DB_WRITE_FAILED on Errors.dbWriteFailedError, and the error is generalized again on another higher layer as ERR_PUTTING_TO_ROUTING

https://github.com/ipfs/js-datastore-core/blob/6851f672c3c60b5fff8131a7c450164aef50ffe9/src/tiered.js#L58

 async put (key, value, options) {
    try {
      await Promise.all(this.stores.map(store => store.put(key, value, options)))
    } catch (err) {
      throw Errors.dbWriteFailedError()
    }
  }

Below is my crazy monkey style debug code (I don't have time, I need this working now)

  async put (key, value, options) {
    try {
      //await drain(this._dht.put(key, value, options))

      console.log("\n\n\n\n\nLOG: DHTDatastore.put->item START");
      var source = this._dht.put(key, value, options);
      for await (const item of source) {
        console.log("\nLOG: DHTDatastore.put->item", this.log_count++, "->", item, "\n----------------------------------");
      }
      console.log("\nLOG: DHTDatastore.put->item END");
    } catch (/** @type {any} */ err) {
      console.log("EXCEPTION: DHTDatastore.put", err.stack)
      log.error(err)
      console.log("\\===========================/")

      throw err
    }
  }

LOG: DHTDatastore.put->item 957 -> {
  from: Ed25519PeerIdImpl [PeerId(12D3KooWBCNW4SjEjSYy5KgMDMPRp2eAs2PFSnMfASf6REh2DgAJ)] {
    type: 'Ed25519',
    multihash: Digest {
      code: 0,
      size: 36,
      digest: [Uint8Array],
      bytes: [Uint8Array]
    },
    privateKey: undefined,
    publicKey: Uint8Array(36) [
        8,   1,  18,  32,  20, 126, 215,  94, 102,
      237, 135,  58, 179,  38, 127,  31, 137, 206,
       63,  57, 203, 216, 240,  94, 238, 108, 115,
       56,  93,  38, 162,  58,  10, 252,   4, 199
    ]
  },
  error: [AggregateError: All promises were rejected] {
    code: 'ERR_TIMEOUT',
    [errors]: [ [AbortError], [Error], [Error] ]
  },
  name: 'QUERY_ERROR',
  type: 3
}

----------------------------------
EXCEPTION: DHTDatastore.put Error: Query aborted
    at AbortSignal.<anonymous> (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/@libp2p/kad-dht/dist/src/query/query-path.js:145:37)
    at AbortSignal.[nodejs.internal.kHybridDispatch] (node:internal/event_target:736:20)
    at AbortSignal.dispatchEvent (node:internal/event_target:678:26)
    at abortSignal (node:internal/abort_controller:292:10)
    at AbortController.abort (node:internal/abort_controller:323:5)
    at AbortSignal.onAbort (/home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/any-signal/index.js:11:16)
    at AbortSignal.[nodejs.internal.kHybridDispatch] (node:internal/event_target:736:20)
    at AbortSignal.dispatchEvent (node:internal/event_target:678:26)
    at abortSignal (node:internal/abort_controller:292:10)
    at AbortController.abort (node:internal/abort_controller:323:5)
\===========================/
Uncaught:
Error: ipns record for f5uxa3ttf4ackcacciqqgqlnqgn647kdrz3alsdluldttcxd6orqsepev64b2zgmnnt36zfh could not be stored in the routing - Error: Write failed
    at Module.dbWriteFailedError (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/datastore-core/src/errors.js:23:16)
    at TieredDatastore.put (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/datastore-core/src/tiered.js:58:20)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processTimers (node:internal/timers:499:9)
    at async IpnsPublisher._publishEntry (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/ipns/publisher.js:90:19)
    at async IpnsPublisher._putRecordToRouting (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/ipns/publisher.js:77:5)
    at async IPNS.publish (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/ipns/index.js:47:7)
    at async publish (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/components/name/publish.js:92:20)
    at IpnsPublisher._publishEntry (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/ipns/publisher.js:99:21)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at runNextTicks (node:internal/process/task_queues:65:3)
    at processTimers (node:internal/timers:499:9)
    at async IpnsPublisher._putRecordToRouting (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/ipns/publisher.js:77:5)
    at async IPNS.publish (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/ipns/index.js:47:7)
    at async publish (file:///home/algorista/prj-xxxxxx/xxxxxx-poc/node_modules/ipfs-core/src/components/name/publish.js:92:20)
    at async REPL215:1:33 {
  code: 'ERR_PUTTING_TO_ROUTING'
}

I tried to identify if the error happened in a specific item of the 'drain', but that's not how it happens, the exception appears at the end, before the last item (I suppose it's the last one, I can't guarantee that)

other potential implications

I believe this same bug also introduces a memory leak, a big one, but that's just a guess from my experience with js.

other 'issues' presenting the same problem

I will list here two other issues that have presented the same problem, where the palliative solution found by the users was a loop of attempts or restarting the node, where for a few moments all the peers are newly connected and do not have a high TIMEOUT chance. In this second case I tested and could confirm that the DHT is not connected to any peer and therefore the publish will not do anything, not even a timeout, so it is not a solution.

https://github.com/ipfs/js-ipfs/issues/4202

https://github.com/ipfs/js-ipfs/issues/4023

the solution

I made a fork and introduced these modifications. It's not an elegant solution, but it does the trick and I'll be able to get my app live later today.

https://github.com/algorista/js-ipfs/commit/0ee74f20f7ef681592ca19a3a7784dc95686c3c6

 async put (key, value, options) {
    try {
      //REMOVED: await drain(this._dht.put(key, value, options))
      let responses = this._dht.put(key, value, options);
      while(true) {
        let item;
        try {
          item = await responses.next();
        } catch(e) {
          if (e.message != "Query aborted") throw e;
        }
        if (item && item.done) break;
      }
    } catch (/** @type {any} */ err) {
      log.error(err)
      throw err
    }
  }

I believe that the elegant solution would be to properly handle the AbortEvent in layers closest to libp2p and ensure that at least some peers replicated the ipns, and only if none were reached, then send a throw. I also think there's room for more parallelization in that .put. With a little time I will study this calmly.

final considerations

If anyone arrives here like I did, looking for an urgent solution, you might consider temporarily installing from my fork, or monkeypatch from there.

npm install 'https://github.com/algorista/js-ipfs.git#algorista'

welcome[bot] commented 1 year ago

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

"Priority" labels will show how urgent this is for the team.
"Status" labels will show if this is ready to be worked on, blocked, or in progress.
"Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

algorista commented 1 year ago

I want to add that I couldn't get the github fork to work on npm, nor could I build it through lerna. So all I have left is to do the monkeypatch directly in the _nodemodules folder, which solves the problem on the server but does not solve the problem in the bundle that goes to the web user.

It's still a problem.

  Lerna (powered by Nx)   Cannot read properties of undefined (reading 'data')

algorista commented 1 year ago

I also want to add that this message must be the same bug in the .put in the libp2p DHT, because each timeout will insert an abort event and overflow the normal limit of the node. after all, there are almost 300 connections going on, dozens of timeouts can occur at all times.

(node:14920) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(Use `node --trace-warnings ...` to show where the warning was created)

RangerMauve commented 1 year ago

Was running into this too and ended up switching to Kubo with ipfs-http-client (which introduced another set of bugs 😛

achingbrain commented 1 year ago

js-IPFS is in the process of being deprecated, the replacement is Helia - please see the State of IPFS in JS post for a bit of background and the migration guide for how to port your app over.

When running a network query to publish IPNS records, a number of sub-queries are run to find KAD-close peers - individual sub-queries can fail with a timeout or other error without the whole query being aborted, but if they all fail then the whole query will fail.

Unfortunately the set of peers you will query depends on the KAD-ID of the key you are trying to publish so it's very hard to replicate in a deterministic manner.

What will happen though, is that some peers will be harder to dial than others as they might be publishing private (e.g. un-dialable) addresses or they might not support the right set of network transports.

Helia ships with the latest transports and will be where fixes are deployed from now on so please can you upgrade your app and see if the problem persists?

ipfs / js-ipfs