microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
32 stars 27 forks source link

Use blobfuse2 for streamable TesInputs #692

Open MattMcL4475 opened 4 months ago

MattMcL4475 commented 4 months ago

Problem: Customers need the ability to perform random reads using a file system for large genomics reference files without downloading the entire file, which costs more and puts pressure on the storage account.

Solution:

I confirmed that random reads in blobfuse2 work as expected: blobfuse2 mount /ref --config-file=./b2.yaml dd if=stLFR.split_read.1.fq.gz skip=50000000000 bs=1 count=128 iflag=skip_bytes 2>/dev/null | xxd

image image

#!/bin/bash

# Azure Blob URL - NOTE SAS has been removed
blob_url="https://mattmcl.blob.core.windows.net/inputs/stLFR.split_read.1.fq.gz" 

# Byte range to download: Example uses the range from 50000000000 to 50000000127
range_start=50000000000
range_end=50000000127

# Using curl to download the specified byte range
curl -s -o downloaded_bytes.bin -H "Range: bytes=$range_start-$range_end" "$blob_url"
echo "From REST:"
# Display downloaded bytes in hex format for comparison
xxd downloaded_bytes.bin
echo "From blobfuse:"
# Optional: Compare with bytes extracted from the local file using dd
dd if=/ref/stLFR.split_read.1.fq.gz skip=$range_start bs=1 count=$((range_end - range_start + 1)) iflag=skip_bytes,count_bytes 2>/dev/null | xxd