seung-lab / cloud-volume

Read and write Neuroglancer datasets programmatically.
https://twitter.com/thundercloudvol
BSD 3-Clause "New" or "Revised" License
127 stars 42 forks source link

Creating a precomputed cloudvolume from an HDF5 file #613

Open Magic-Ludo opened 1 month ago

Magic-Ludo commented 1 month ago

I have 3620 images in .tif format, totalling 1TB.

Here's their info:

Shape: (21796, 12876)
dtype: uint16

I'm looking to use Neuroglancer to visualize this dataset. To achieve this, I have converted all the .tif files to .h5 format with the following script:

import os
import h5py
import tifffile as tiff
from tqdm import tqdm

tif_dir = "/path/to/tif/folder/"
output_file = "/path/to/file.h5"

image_files = sorted(
    [os.path.join(tif_dir, f) for f in os.listdir(tif_dir) if f.endswith(".tif")]
)

first_image = tiff.imread(image_files[0])
img_shape = first_image.shape

with h5py.File(output_file, "w") as h5f:
    # Create a dataset to hold the TIFF images
    dtype = first_image.dtype

    # Create the dataset with chunking and compression
    dataset = h5f.create_dataset(
        "images",
        shape=(len(image_files),) + img_shape,
        dtype=dtype,
        chunks=(1, img_shape[0] // 10, img_shape[1] // 10),  # Example chunking
        compression="gzip"
    )

    # Iterate through each TIFF file and write it to the HDF5 dataset
    for i, filename in enumerate(tqdm(image_files, desc="Processing TIFF files")):
        if filename.endswith(".tif"):
            filepath = os.path.join(tif_dir, filename)
            image_data = tiff.imread(filepath)
            dataset[i, :, :] = image_data

The .h5 file generated is very large (900 GB) and therefore cannot be loaded into RAM. From what I've read, you can pre-compute the contents of this file to display it in chunks on-demand with CloudVolume. I tried this with the following script:

import tifffile as tiff
import h5py
import os
from tqdm import tqdm
from cloudvolume import CloudVolume
from cloudvolume.lib import mkdir

tif_dir = "/path/to/tif/folder/"

image_files = sorted(
    [os.path.join(tif_dir, f) for f in os.listdir(tif_dir) if f.endswith(".tif")]
)

first_image = tiff.imread(image_files[0])
img_shape = first_image.shape

input_hdf5_file = "/path/to/file.h5"
dataset_name = "images"

output_dir = "file:///path/to/Neuroglancer/folder"
mkdir(output_dir.replace("file://", ""))

with h5py.File(input_hdf5_file, "r") as h5f:
    dataset = h5f[dataset_name]
    data_shape = dataset.shape
    dtype = dataset.dtype
    print(f"Dataset shape: {data_shape}")
    print(f"Dataset dtype: {dtype}")

    # Create a CloudVolume object for the Neuroglancer precomputed format
    vol = CloudVolume(
        output_dir,
        info={
            "type": "image",
            "data_type": str(dtype),
            "num_channels": 1,
            "scales": [
                {
                    "chunk_sizes": [1, img_shape[0] // 10, img_shape[1] // 10],
                    "encoding": "raw",
                    "key": "full_resolution",
                    "resolution": [1, 1, 1],
                    "size": list(data_shape),  # Neuroglancer expects (z, y, x) order
                }
            ],
            "voxel_offset": [0, 0, 0],
        },
    )
    vol.commit_info()
    print("CloudVolume info:")
    print(vol.info)

    # Write data to the Neuroglancer precomputed format
    for z in tqdm(range(data_shape[0]), desc="Converting to Neuroglancer format"):
        slice_data = dataset[z, :, :]
        print(f"slice_data shape: {slice_data.shape}")
        slice_data = slice_data.reshape((slice_data.shape[1], slice_data.shape[0], 1))
        vol[:, :, z] = slice_data

But on the way out I find myself with a size incompatibility problem:

First image shape: (21796, 12876)
Dtype: uint16
Dataset shape: (3620, 21796, 12876)
Dataset dtype: uint16
CloudVolume info:
{'type': 'image', 'data_type': 'uint16', 'num_channels': 1, 'scales': [{'chunk_sizes': [1, 2179, 1287], 'encoding': 'raw', 'key': 'full_resolution', 'resolution': [1, 1, 1], 'size': [3620, 21796, 12876]}], 'voxel_offset': [0, 0, 0]}
Converting to Neuroglancer format:   0%|          | 0/3620 [00:02<?, ?it/s]
slice_data shape: (21796, 12876)

---------------------------------------------------------------------------
AlignmentError                            Traceback (most recent call last)
Cell In[18], line 65
     61         print(f"slice_data shape: {slice_data.shape}")
     62         slice_data = slice_data.reshape(
     63             (slice_data.shape[1], slice_data.shape[0], 1)
     64         )
---> 65         vol[:, :, z] = slice_data
     67 print("Conversion to Neuroglancer precomputed format completed.")

File ~/.conda/envs/SEG/lib/python3.10/site-packages/cloudvolume/frontends/precomputed.py:962, in CloudVolumePrecomputed.__setitem__(self, slices, img)
    959   imgshape = imgshape + [ self.num_channels ]
    961 if not np.array_equal(imgshape, slice_shape):
--> 962   raise exceptions.AlignmentError("""
    963     Input image shape does not match slice shape.
    964 
    965     Image Shape: {}  
    966     Slice Shape: {}
    967   """.format(imgshape, slice_shape))
    969 if self.autocrop:
    970   if not self.bounds.contains_bbox(bbox):

AlignmentError: 
        Input image shape does not match slice shape.

        Image Shape: [12876, 21796, 1, 1]  
        Slice Shape: [3620, 21796, 1, 1]

I've tried changing dimensions or reversing the valuers for each slice, but I always end up with a problem like this...

Thanks for your help!

william-silversmith commented 1 month ago

Hi!

Thanks for writing in. The first thing I would try, since you seem to be inserting these slices in an XY plane, is to set the chunk size to be [1024,1024,1], currently it is a single pixel in the x direction. Second, maybe try doing a transpose instead of a reshape? slice_data.T You can then do a np.squeeze. CloudVolume expects 4D arrays so add two trivial dimensions to the end:

slice_data = np.squeeze(slice_data.T)
while slice_data.ndim < 4:
    slice_data = slice_data[..., np.new_axis]

Let me know if this helps.