Problem:
Customers need the ability to perform random reads using a file system for large genomics reference files without downloading the entire file, which costs more and puts pressure on the storage account.
Solution:
If any TesInput.Streamable is set to true, the TES runner should download and install blobfuse2
It should aggregate all of the container mounts and only mount the minimum required mounts with blobfuse2 mount
It should ensure the path specified for the TesInput.path works
I confirmed that random reads in blobfuse2 work as expected:
blobfuse2 mount /ref --config-file=./b2.yamldd if=stLFR.split_read.1.fq.gz skip=50000000000 bs=1 count=128 iflag=skip_bytes 2>/dev/null | xxd
#!/bin/bash
# Azure Blob URL - NOTE SAS has been removed
blob_url="https://mattmcl.blob.core.windows.net/inputs/stLFR.split_read.1.fq.gz"
# Byte range to download: Example uses the range from 50000000000 to 50000000127
range_start=50000000000
range_end=50000000127
# Using curl to download the specified byte range
curl -s -o downloaded_bytes.bin -H "Range: bytes=$range_start-$range_end" "$blob_url"
echo "From REST:"
# Display downloaded bytes in hex format for comparison
xxd downloaded_bytes.bin
echo "From blobfuse:"
# Optional: Compare with bytes extracted from the local file using dd
dd if=/ref/stLFR.split_read.1.fq.gz skip=$range_start bs=1 count=$((range_end - range_start + 1)) iflag=skip_bytes,count_bytes 2>/dev/null | xxd
Problem: Customers need the ability to perform random reads using a file system for large genomics reference files without downloading the entire file, which costs more and puts pressure on the storage account.
Solution:
TesInput.Streamable
is set totrue
, the TES runner should download and install blobfuse2blobfuse2 mount
path
specified for the TesInput.path worksI confirmed that random reads in blobfuse2 work as expected:
blobfuse2 mount /ref --config-file=./b2.yaml
dd if=stLFR.split_read.1.fq.gz skip=50000000000 bs=1 count=128 iflag=skip_bytes 2>/dev/null | xxd