Script generate_data.py requires information about previous utxos referenced by tx inputs. Currently it is implemented as a series of getrawtransaction and getblockheaderrequests to a bitcoin node. It is inefficient, as it takes >30s to process a single block, which means that accessing full history would take weeks.
In order to speed this process up, a query to the Google Bitcoin Data Set was created:
SELECT
inputs.block_number block_number,
array_agg(
struct(
outputs.transaction_hash as txid,
outputs.index as vout,
outputs.value,
outputs.script_hex as pk_script,
outputs.block_number as block_height,
txs.is_coinbase
)
) as outputs
FROM `bigquery-public-data.crypto_bitcoin.inputs` as inputs
JOIN `bigquery-public-data.crypto_bitcoin.outputs` as outputs
ON outputs.transaction_hash = inputs.spent_transaction_hash
AND outputs.index = inputs.spent_output_index
JOIN `bigquery-public-data.crypto_bitcoin.transactions` as txs
ON txs.hash = inputs.spent_transaction_hash
JOIN `bigquery-public-data.crypto_bitcoin.blocks` as blocks
ON blocks.number = outputs.block_number
group by block_number
order by block_number
This gives us per block information required by the script.
The data, one block data in json format per line, was exported to the Cloud Storage. There are 772143 rows spread across 1732 files, each <1Gb large. Example: 000000000000.json, last file name is 000000001731.json
The task
Your task is to create a Python script, which will:
download and split the files into chunks of managable size
create block number -> chunk name index which will allow to locate chunk quickly
create a python function which when given a block number will return corresponding utxo set
Details
Download and Split
For each data dump file:
download ile from GCS
create a directory with a name corresponding to the name of the file
use unix split command to break file into chunks, chunks should be placed in the directory created in the previous step, split by number of lines (number of line should be an easily changeable parameter), e.g.: split -l 10 utxos_000000000049.json
Processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.
Index
You need to:
load each chunk created in the previous step
update an in memory block number -> chunk name map
save the map as a json file
add consistency checks (there should be only file per block, any other ideas)
get_utxo_set function
Given block number:
use index file to locate corresponding chunk, assume that index file is available in the filesystem
if chunk is not present execute download and split.
locate corresponding line in the chunk
return parsed json data
Time constraints
This task is on the critical path, it is blocking other tasks, please do not volunteer if you can't complete it in 2-3 days.
@fishonamos please note that processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.
Context
Script generate_data.py requires information about previous utxos referenced by tx inputs. Currently it is implemented as a series of
getrawtransaction
andgetblockheader
requests to a bitcoin node. It is inefficient, as it takes >30s to process a single block, which means that accessing full history would take weeks.In order to speed this process up, a query to the Google Bitcoin Data Set was created:
This gives us
per block
information required by the script.The data, one block data in json format per line, was exported to the Cloud Storage. There are 772143 rows spread across 1732 files, each <1Gb large. Example: 000000000000.json, last file name is 000000001731.json
The task
Your task is to create a Python script, which will:
Details
Download and Split
For each data dump file:
split
command to break file into chunks, chunks should be placed in the directory created in the previous step, split by number of lines (number of line should be an easily changeable parameter), e.g.:split -l 10 utxos_000000000049.json
Processing all files is not in scope. Just make sure it works on a couple of files. We will run it on a machine in the cloud.
Index
You need to:
get_utxo_set
functionGiven block number:
Time constraints
This task is on the critical path, it is blocking other tasks, please do not volunteer if you can't complete it in 2-3 days.