AlanSimmons commented 1 year ago

Request Summary

Provide API endpoints to allow high-level characterization of large data files without the need to download the entire file. The initial implementation will support:

RAW files generated from LC-MS assays
HD5Seurat files generated from scRNA-Seq assays

Request Context

Project scope

API support for HuBMAP Demonstration Projects

Requestors

Alexandra Naba Yu (Tom) Gao University of Illinois of Chicago

Demonstration Project

Profiling the Extracellular Matrix (ECM)/matrisome

Statement of Problem

Researchers wish to select subsets of data files in HuBMAP datasets related to LC-MS and scRNA-Seq assays by criteria that are based on content at specific locations in the files. In general, these files are binary and without specifications (by design, for confidentiality); however, files are structured such that the data useful for some high-level characterization is stored in known, specific locations--e.g., within a header block, or within the first 1028 bytes.

Currently, to characterize data files for selection, researchers download entire files manually using the Data Portal and Globus File Manager User Interfaces. The image at the end of this document illustrates the workflow that the researchers use. The manual workflow both is laborious and requires large bandwidth: researchers must download and process large files locally.

A better solution would involve calling an API endpoint that returns characterization information from the data files without downloading the files locally.

Solution options

This appears to be a potential enhancement of the files-api. The files-api currently searches an ElasticSearch index of file information. It may be possible to parse additional characterization information for specific file types (e.g., .raw and .hd5) and add this information to responses.

Dr. Gao has agreed to provide source for Python scripts that can parse the binary data files for characterization information.

Notes

We may determine that the files-api is not the source for the endpoint.
Enhancing the files-api with this endpoint would require re-generating an ElasticSearch index. Deployment of the endpoint may require an iterative approach--i.e., a prospective version that can characterize a single file, and a retrospective version built into the indexing.
The UIC team plans to ask researchers at other institutions for additional input--i.e., whether there might be any additional ways to characterize data files based on similar forms of parsing. However, because we already have two clear and relatively constrained use cases (.raw files for LC-MS; .hd5seurat files for scRNA-Seq), we will start with the UIC team's requests.
We anticipate that these kinds of characterization features will be of general application. The work to satisfy the UIC team's requests may stimulate further interest from other research groups that work with different types of assays.

AlanSimmons commented 1 year ago

RAW file header extraction code

Tom provided the parsing prototype script that obtains header information from a RAW file. The script runs on Python3 with no external package dependencies.

As I understand, this basic functionality would need to be called from code that also uses the file api and integrates with Globus for authentication, etc.

From Tom's email. He provides a public RAW file as sample input.

The correct behavior is to output two identical checksum from a raw file within <1s. Usage: python3 raw_head_reader.py [path_to_raw_file] Example raw file: https://app.globus.org/file-manager?origin_id=af603d86-eab9-4eec-bb1d-9d26556741bb&origin_path=%2F96cf9f5d33a48e7e61e1ee00ad282b8a%2FProteomics%2Fraw_data%2F

HuBMAP public/96cf9f5d33a48e7e61e1ee00ad282b8a/Proteomics/raw_data/VAN0027-RK-1-1_5.raw

Source:

# raw_head_extractor.raw_head_reader created by bathy at 1/3/2023
from zlib import adler32
import struct

BLOCKSIZE=10485760-152
def calc_adler32(filename):
    asum = 0
    with open(filename,'rb') as f:
        header=f.read(152)
        header_list=list(header)
        header_list[-4:]=[0,0,0,0]
        header=bytearray(header_list)
        asum = adler32(header, asum)
        data = f.read(BLOCKSIZE)
        asum = adler32(data, asum)
        if asum < 0:
            asum += 2**32
    rev_asum=struct.unpack('<L',struct.pack('>L',asum))[0]
    return hex(rev_asum)[2:10].zfill(8).upper()

def read_adler32_checksum(raw_file):
    with open(raw_file, 'rb') as file_raw:
        file_header = file_raw.read(152)
        signature = file_header[:18]
        checksum = file_header[-4:]
    if signature == b'\x01\xA1\x46\x00\x69\x00\x6E\x00\x6E\x00\x69\x00\x67\x00\x61\x00\x6E\x00':
        return ''.join(format(n, '02X') for n in checksum)
    else:
        return 'Not Thermo Raw File'

if __name__ == '__main__':
    import sys, os

    if len(sys.argv)<2:
        print("Usage python3 raw_head_reader.py [path_to_raw_file]")
    else:
        file_path = str(sys.argv[1])
        if os.path.exists(file_path):
            print("Your input file checksum tag is: %s" % read_adler32_checksum(file_path))
            print("The checksum of your file is: %s" % calc_adler32(file_path))

AlanSimmons commented 1 year ago

After discussion with @shirey, we think that it may be possible to achieve this by enhancing the existing ElasticSearch index for files, used by the files-api. If discrete data elements from the RAW and hd5 files are needed, we can add them to the index.

Because we are nearing limits for the number of attributes that we can include in ES indexes, we would want to keep the possible file attributes small.

AlanSimmons commented 1 year ago

We validated that the sample script can obtain data from a local RAW file. Tom offered to provide a script that obtains specific data elements.

From Tom's email:

Mass spec instrument method

LC instrument method

Running parameters of the raw file

Some other statistics of the data

HD5 data, we will need the data structure of the file (database scheme) and number of data points

hubmapconsortium / files-api

Epic - Demonstration Project Request: characterize data files based on content #21

Request Summary

Request Context

Project scope

Requestors

Demonstration Project

Statement of Problem

Solution options

Notes

RAW file header extraction code