maciejka commented 1 month ago

Context

Script generate_data.py requires information about previous utxos referenced by tx inputs. Currently it is implemented as a series of getrawtransaction and getblockheader requests to a bitcoin node. It is inefficient, as it takes >30s to process a single block, which means that accessing full history would take weeks.

In order to speed this process up, a query to the Google Bitcoin Data Set was created:

SELECT 
  inputs.block_number block_number,
  array_agg(
    struct(
      outputs.transaction_hash as txid, 
      outputs.index as vout,
      outputs.value,
      outputs.script_hex as pk_script,
      outputs.block_number as block_height,
      txs.is_coinbase
    )
  ) as outputs
FROM `bigquery-public-data.crypto_bitcoin.inputs` as inputs
JOIN `bigquery-public-data.crypto_bitcoin.outputs` as outputs 
  ON outputs.transaction_hash = inputs.spent_transaction_hash
  AND outputs.index = inputs.spent_output_index
JOIN `bigquery-public-data.crypto_bitcoin.transactions` as txs
  ON txs.hash = inputs.spent_transaction_hash
JOIN `bigquery-public-data.crypto_bitcoin.blocks` as blocks
  ON blocks.number = outputs.block_number
group by block_number
order by block_number

This gives us per block information required by the script.

The data, one block data in json format per line, was exported to the Cloud Storage. There are 772143 rows spread across 1732 files, each <1Gb large. Example: 000000000000.json, last file name is 000000001731.json

The task

Your task is to create a Python script, which will:

download and split the files into chunks of managable size
create block number -> chunk name index which will allow to locate chunk quickly
create a python function which when given a block number will return corresponding utxo set

Details

Download and Split

For each data dump file:

download ile from GCS
create a directory with a name corresponding to the name of the file
use unix split command to break file into chunks, chunks should be placed in the directory created in the previous step, split by number of lines (number of line should be an easily changeable parameter), e.g.: split -l 10 utxos_000000000049.json

Processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

Index

You need to:

load each chunk created in the previous step
update an in memory block number -> chunk name map
save the map as a json file
add consistency checks (there should be only file per block, any other ideas)

`get_utxo_set` function

Given block number:

use index file to locate corresponding chunk, assume that index file is available in the filesystem
if chunk is not present execute download and split.
locate corresponding line in the chunk
return parsed json data

Time constraints

This task is on the critical path, it is blocking other tasks, please do not volunteer if you can't complete it in 2-3 days.

fishonamos commented 1 month ago

Kindly assign @maciejka. Will love to take it up.

maciejka commented 1 month ago

@fishonamos please note that processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.

keep-starknet-strange / raito

[feat] Create a script to split, index and access per block utxo data dump #208

Context

The task

Details

Download and Split

Index

`get_utxo_set` function

Time constraints

keep-starknet-strange / raito

[feat] Create a script to split, index and access per block utxo data dump #208

Context

The task

Details

Download and Split

Index

get_utxo_set function

Time constraints

`get_utxo_set` function