Stacks API Event Replay Procedure does not Succeed

AshtonStephens commented 8 months ago

Describe the bug Stacks API event replay procedure cannot complete.

To Reproduce Steps to reproduce the behavior:

Download event archive from https://archive.hiro.so/testnet/stacks-blockchain-api/testnet-stacks-blockchain-api-latest.gz
Run the event replay. Note your machine will likely need access to > 16Gb of RAM.
Ingest the events from the output following the api documentation.

Below is a script that does the majority of what I did, minus some initial installation. I did not check whether this script works but it's fully representative of what I did, including the environment. The only difference is I made a separate fork with two changes that I elaborate on below.

EVENT_REPLAY_NUM_WORKERS=4
WORKING_DIR=$(pwd)

# Download archive
 wget -q \
  "https://archive.hiro.so/testnet/stacks-blockchain-api/testnet-stacks-blockchain-api-latest.gz" \
  -O event-replay-archive.gz

# Download and run the event replay.
(
    git clone https://github.com/hirosystems/stacks-event-replay.git -b "v1.0.2"
    cd stacks-event-replay

    # Make virtual env
    python3 -m venv .venv
    source ".venv/bin/activate"
    pip install --upgrade pip
    pip install -r requirements.txt
    python3 -m event_replay --tsv-file ../event-replay-archive.gz

    # move the event output up a level
    mv events ../events
)

# Make sure the postgres output data path exists.
mkdir -p /var/lib/postgresql/data
# Run the postgres image
docker run -d \
  --name postgres \
  -p 5432:5432 \
  -e POSTGRES_DB=stacks_blockchain_api \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -v postgresql_data:/var/lib/postgresql/data \
  postgres:15-alpine

(
    # Now run and build the API.
    # Note, you probably won't  get as far as my setup does until you upgrade duckdb to version "0.10.0",
    # and you'll also need to adjust the "1400" tx batch parameter I mention below.
    git clone https://github.com/hirosystems/stacks-blockchain-api.git -b "v7.8.2"

    # Note: Another nit is that the API cannot be compiled outside of a git repo, so it might make sense
    # to remove the tagged release binaries like are listed in the link because you cannot compile
    # them: https://github.com/hirosystems/stacks-blockchain-api/releases/tag/v7.8.2
    echo "GIT_TAG=v7.8.2" >> .env

    # Make sure you have node v20
    npm install
    npm run build
    npm prune --production

    {
          echo "STACKS_EVENTS_DIR=$WORKING_DIR/events"
          # Set the max memory usage to 29Gb.
          # I started at the recommended 8Gb but I kept needing to increase it because
          # it would crash.
          echo "NODE_OPTIONS=\"--max-old-space-size=29696\""
          echo "PG_PORT=5432"
          echo "PG_DATABASE=stacks_blockchain_api"

          # configure the chainID/networkID; testnet: 0x80000000, mainnet: 0x00000001
          echo "STACKS_CHAIN_ID=0x80000000"
          # manually set testnet values to connect to the blockstack testnet.
          echo "BTC_RPC_HOST=bitcoind.testnet.stacks.co"
          echo "BTC_RPC_PORT=18332"
          echo "BTC_RPC_USER=blockstack"
          echo "BTC_RPC_PW=blockstacksystem"
          # These probably don't matter for event replay, but they're part of my setup.
          echo "STACKS_CORE_EVENT_PORT=3700"
          echo "STACKS_CORE_EVENT_HOST=127.0.0.1"
          echo "STACKS_BLOCKCHAIN_API_PORT=3999"
          echo "STACKS_BLOCKCHAIN_API_HOST=127.0.0.1"
          echo "STACKS_CORE_RPC_HOST=127.0.0.1"
          echo "STACKS_CORE_RPC_PORT=20443"
    } >> .env

    node ./lib/index.js from-parquet-events --workers="$EVENT_REPLAY_NUM_WORKERS"
 )

What you'll likely see:

One bug I found in this process is here: https://github.com/hirosystems/stacks-blockchain-api/blob/develop/src/event-replay/parquet-based/importers/new-block-importer.ts#L87, where the API should not be batching 1400 Txs. It looks like these each turn into more than 46 parameters to the SQL database, meaning that this number exceeds the maximum parameter count of 65534.

Below is the error, but changing the line I highlighted to 500 Fixes it for the time being.

{"level":"info","time":"2024-03-03T01:23:25.269Z","pid":21,"hostname":"d5c4f5914a8f","name":"stacks-blockchain-api","component":"event-replay","msg":"NEW_BLOCK events process started"}
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

Error: MAX_PARAMETERS_EXCEEDED: Max number of parameters (65534) exceeded
    at toBuffer (/app/node_modules/@hirosystems/api-toolkit/node_modules/postgres/cjs/src/connection.js:182:20)

There are some other issues as well, some with duckdb, but after upgrading to version 0.10.0 (latest) of duckdb that went away. Maybe that's a problem but the program progressed after I upgraded so I suspect it's fine.

But once all the other errors went away we now get this error:

{"level":"info","time":"2024-03-04T00:29:41.059Z","pid":43764,"hostname":"...","name":"stacks-blockchain-api","component":"event-replay","msg":"Worker has finished"}
┌─────────┬────────────────────┬───────────┐
│ (index) │ name               │ seconds   │
├─────────┼────────────────────┼───────────┤
│ 0       │ 'NEW_BLOCK_EVENTS' │ '528.712' │
└─────────┴────────────────────┴───────────┘
{"level":"info","time":"2024-03-04T00:29:41.062Z","pid":43764,"hostname":"...,"name":"stacks-blockchain-api","component":"event-replay","msg":"RAW events process started"}

<--- Last few GCs --->

[43764:0x59640e0]   564172 ms: Scavenge (reduce) 4093.7 (4138.6) -> 4093.7 (4139.3) MB, 12.65 / 0.00 ms  (average mu = 0.972, current mu = 0.806) allocation failure;
[43764:0x59640e0]   564323 ms: Mark-Compact (reduce) 4094.9 (4139.6) -> 4094.9 (4140.6) MB, 144.38 / 0.00 ms  (+ 15.8 ms in 18 steps since start of marking, biggest step 15.4 ms, walltime since start of marking 209 ms) (average mu = 0.947, current mu = 0.

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0xca5580 node::Abort() [node]
 2: 0xb781f9  [node]
 3: 0xeca4d0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
 4: 0xeca7b7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
 5: 0x10dc505  [node]
 6: 0x10f4388 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0x10ca4a1 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0x10cb635 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 9: 0x10a7d56 v8::internal::Factory::AllocateRaw(int, v8::internal::AllocationType, v8::internal::AllocationAlignment) [node]
10: 0x1099984 v8::internal::FactoryBase<v8::internal::Factory>::AllocateRawWithImmortalMap(int, v8::internal::AllocationType, v8::internal::Map, v8::internal::AllocationAlignment) [node]
11: 0x109c166 v8::internal::FactoryBase<v8::internal::Factory>::NewRawOneByteString(int, v8::internal::AllocationType) [node]
12: 0x10b3184 v8::internal::Factory::NewStringFromUtf8(v8::base::Vector<char const> const&, v8::internal::AllocationType) [node]
13: 0xedcdc2 v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [node]
14: 0xc479f5 napi_create_string_utf8 [node]
15: 0x7f96a9bb1edb Napi::String::New(napi_env__*, char const*, unsigned long) [.../stacks-blockchain-api/node_modules/duckdb/lib/binding/duckdb.node]
16: 0x7f96a9bb0196  [.../stacks-blockchain-api/node_modules/duckdb/lib/binding/duckdb.node]
17: 0x7f96a9bb083d  [.../stacks-blockchain-api/node_modules/duckdb/lib/binding/duckdb.node]

The program has 29Gb of Ram available to it on a 32Gb Ram machine. I could get a 64Gb machine going, but at this point I suspect there is something else is going wrong. It might make sense to include running the event replay procedure to some local testing steps so that this is easier to run when Nakamoto releases.

Expected behavior We should be able to run the API and ingest the event archive in the way listed in the docs.

Additional context

This is needed to run part of a potential Nakamoto debugging environment, and is the only current part of the network that is failing to start up. It would be great if we could get this fixed in the very near future.

AshtonStephens commented 8 months ago

@wileyj

csgui commented 8 months ago

@AshtonStephens @wileyj GM. I can check this issue, as well, since I've worked on this event-replay implementation. Thanks!

AshtonStephens commented 8 months ago

Its not the end of the world that it takes 16Gb for the event replay, but it would be really nice if it didn't. I think you could have pandas output the files as it goes as opposed to in one fell swoop.

csgui commented 7 months ago

The work behind this event-replay implementation was to make it fast. So, to achieve that there is a tradeoff between computer resources usage and speed. Previous versions was taken days to finish.

Some improvements that were made:

The batch size to insert data on the txs table was reduced. This will reduce the amount of parameters being passed to postgresql.
When inserting raw events, all the related Parquet files were read in one operation, which increased memory usage. This process has been changed to read one file at a time.
Some changes to support nakamoto data.

To validate those changes, the file https://archive.hiro.so/testnet/stacks-blockchain-api/testnet-stacks-blockchain-api-latest.gz was used and the event-replay process has finished with success in a Apple M1 Max with 64GB of RAM.

The suggestions above will be taken into consideration in improvements to the event-replay process. Thanks.

@AshtonStephens @wileyj please, fee free to reach out if anything else is need.

hirosystems / stacks-blockchain-api

Stacks API Event Replay Procedure does not Succeed #1879

What you'll likely see: