keep-starknet-strange / raito

Bitcoin ZK client written in Cairo.
https://raito.wtf
MIT License
40 stars 34 forks source link

[feat] Create a script to split, index and access per block utxo data dump #208

Closed maciejka closed 1 month ago

maciejka commented 1 month ago

Context

Script generate_data.py requires information about previous utxos referenced by tx inputs. Currently it is implemented as a series of getrawtransaction and getblockheader requests to a bitcoin node. It is inefficient, as it takes >30s to process a single block, which means that accessing full history would take weeks.

In order to speed this process up, a query to the Google Bitcoin Data Set was created:

SELECT 
  inputs.block_number block_number,
  array_agg(
    struct(
      outputs.transaction_hash as txid, 
      outputs.index as vout,
      outputs.value,
      outputs.script_hex as pk_script,
      outputs.block_number as block_height,
      txs.is_coinbase
    )
  ) as outputs
FROM `bigquery-public-data.crypto_bitcoin.inputs` as inputs
JOIN `bigquery-public-data.crypto_bitcoin.outputs` as outputs 
  ON outputs.transaction_hash = inputs.spent_transaction_hash
  AND outputs.index = inputs.spent_output_index
JOIN `bigquery-public-data.crypto_bitcoin.transactions` as txs
  ON txs.hash = inputs.spent_transaction_hash
JOIN `bigquery-public-data.crypto_bitcoin.blocks` as blocks
  ON blocks.number = outputs.block_number
group by block_number
order by block_number

This gives us per block information required by the script.

The data, one block data in json format per line, was exported to the Cloud Storage. There are 772143 rows spread across 1732 files, each <1Gb large. Example: 000000000000.json, last file name is 000000001731.json

The task

Your task is to create a Python script, which will:

Details

Download and Split

For each data dump file:

  1. download ile from GCS
  2. create a directory with a name corresponding to the name of the file
  3. use unix split command to break file into chunks, chunks should be placed in the directory created in the previous step, split by number of lines (number of line should be an easily changeable parameter), e.g.: split -l 10 utxos_000000000049.json

Processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

Index

You need to:

  1. load each chunk created in the previous step
  2. update an in memory block number -> chunk name map
  3. save the map as a json file
  4. add consistency checks (there should be only file per block, any other ideas)

get_utxo_set function

Given block number:

  1. use index file to locate corresponding chunk, assume that index file is available in the filesystem
  2. if chunk is not present execute download and split.
  3. locate corresponding line in the chunk
  4. return parsed json data

Time constraints

This task is on the critical path, it is blocking other tasks, please do not volunteer if you can't complete it in 2-3 days.

fishonamos commented 1 month ago

Kindly assign @maciejka. Will love to take it up.

maciejka commented 1 month ago

@fishonamos please note that processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.