Inconsistent samples between HEST-1k and HEST-bench

huangtinglin commented 3 weeks ago

I found that the number of spots for most samples in HEST-bench differs from the corresponding samples in HEST-1k. Take TENX141.h5 as an example which is included in LUNG task:

import datasets
from huggingface_hub import snapshot_download

local_dir='/home/tinglin/hest_data'
dataset = datasets.load_dataset(
    'MahmoodLab/hest-bench', 
    cache_dir=local_dir,
    patterns=['*TENX141.h5'],
)

snapshot_download(repo_id="MahmoodLab/hest-bench", 
                  repo_type='dataset', 
                  local_dir='/home/tinglin/hest_data/bench', 
                  allow_patterns=['LUNG/patches/TENX141.h5']
)

TENX141_h5, _ = read_assets_from_h5(
    "/home/tinglin/hest_data/patches/TENX141.h5"
)
print(len(TENX141_h5["barcode"]))  # 3069

TENX141_h5, _ = read_assets_from_h5(
    "/home/tinglin/hest_data/bench/LUNG/patches/TENX141.h5"
)
print(len(TENX141_h5["barcode"])) # 3262

Is this because the data has been updated? Which one should be taken for benchmarking?

pauldoucet commented 3 weeks ago

Hi @huangtinglin, The patches/*.h5 files only contain patches under tissue in order to save storage space. The patches/*.h5 files from the benchmark were generated using an older version of the tissue segmenter (see below).

For consistency with our paper, use the data from hest-bench when benchmarking. For an improved tissue segmentation when training your model, prefer hest.

Old version of the tissue segmenter (bench data)

New version of the tissue segmenter (hest-1k data)

guillaumejaume commented 3 weeks ago

@huangtinglin, small addition: the tissue segmentation (i.e., where the tissue is) is in green; but patching is only done on regions where transcripts were measured, which explain that not all tissue regions have a patch.

huangtinglin commented 3 weeks ago

Thanks for the clarification! That solves my problem. Do you guys plan to update the benchmark based on the updated data?

guillaumejaume commented 3 weeks ago

We may update if we add samples in the benchmark. For the sake of simplicity and consistency, we will keep it this way for now. This said, you are welcome to use the updated samples in your own study.

mahmoodlab / HEST

Inconsistent samples between HEST-1k and HEST-bench #68

Old version of the tissue segmenter (bench data)

New version of the tissue segmenter (hest-1k data)