girder / large_image

Python modules to work with large multiresolution images.
http://girder.github.io/large_image/
Apache License 2.0
189 stars 41 forks source link

Enhance TileSource API to support analysis of all tiles of a whole-slide image #75

Closed cdeepakroy closed 7 years ago

cdeepakroy commented 7 years ago

This issue involves enhancing the interface of the TileSource class to support the analysis of all tiles in a whole-slide image with a simplified interface.

To start a discussion. I have summarize below the requirements that are on my mind:

cdeepakroy commented 7 years ago

Below is code-snippet that segments nuclei and extracts some features in all tiles using openslide and HistomicsTK to get an idea of where things can be simplified:

from histomicstk.preprocessing import color_deconvolution as cd
from histomicstk.preprocessing import color_normalization as cn
from histomicstk import features as ft
from histomicstk.segmentation import label as lb
from histomicstk import segmentation as sg
import histomicstk.utils.TilingSchedule as ts
import histomicstk.utils.ConvertSchedule as cs
import histomicstk.utils as ut

import nucleicuts as nc
import numpy as np
import openslide as os
import os.path as pth
import scipy as sp
import skimage.io as io
import sys
import time

def process_slide(Input, Output, Magnification=20, T=4096, Sigma=25,
                  SigmaMin=4*(2**0.5), SigmaMax=7*(2**0.5), r=10,
                  MinArea=20, MinWidth=5, Background=1e-4,
                  Smoothness=1e-4):
    """
    """

    #define color normalization and deconvolution parameters
    W = np.array([[0.650, 0.072, 0],
                     [0.704, 0.990, 0],
                     [0.286, 0.105, 0]])
    TargetMu = np.array([8.63234435, -0.11501964, 0.03868433])
    TargetSigma = np.array([0.57506023, 0.10403329, 0.01364062])

    # parse filename of slide into path and slide name
    Path, Slide = pth.split(Input)
    Path = Path + "/" 
    Sample = Slide[0:Slide.index('.')]

    # check if slide can be opened
    try:
        I = os.OpenSlide(Input)
    except os.OpenSlideError:
        sys.stdout.write("Cannot find file '" + Input + "'")
        # return
    except os.OpenSlideUnsupportedFormatError:
        sys.stdout.write("Slide format not supported. "
                         "Consult OpenSlide documentation")
        # return

    # generate tiling of slide
    Schedule = ts.TilingSchedule(Input, Magnification, T)

    # convert tiling schedule to low-resolution for tissue mapping
    lrSchedule = cs.ConvertSchedule(Schedule, 1.25)
    lrHeight = I.level_dimensions[lrSchedule.Level][1]
    lrWidth = I.level_dimensions[lrSchedule.Level][0]
    LR = I.read_region((0, 0), lrSchedule.Level, (lrWidth, lrHeight))
    LR = np.asarray(LR)
    LR = LR[:, :, :3]
    if lrSchedule.Factor != 1.0:
        LR = sp.misc.imresize(LR, lrSchedule.Factor)
        lrHeight = LR.shape[0]
        lrWidth = LR.shape[1]
    Mask = sg.SimpleMask(LR)    
    MaskHeight = lrSchedule.Tout * lrSchedule.Y.shape[0]
    MaskWidth = lrSchedule.Tout * lrSchedule.X.shape[1]
    LRMask = np.zeros((MaskHeight, MaskWidth), dtype=np.uint8)
    LRMask[0:lrHeight, 0:lrWidth] = Mask

    # update console
    sys.stdout.write("Processing %s at %f magnification, resizing factor %f\n"
                     % (Slide, Magnification, Schedule.Factor))

    # define normalization parameters
    SourceColor = cn.ReinhardSample(Input, Magnification, 0.05, T)

    # initialize containers to store results
    Features = []  # features
    bX = []  # boundary coordinates
    bY = []
    cX = []  # centroids
    cY = []
    Elapsed = []  # timing data

    # iterate through tiles, normalizing each and segmenting them
    for i in np.arange(Schedule.X.shape[0]):
        for j in np.arange(Schedule.X.shape[1]):

            # check tissue coverage
            Mask = LRMask[
                i * lrSchedule.Tout:(i + 1) * lrSchedule.Tout,
                j * lrSchedule.Tout:(j + 1) * lrSchedule.Tout].astype(np.uint8)
            TissueCount = sum(Mask.flatten().astype(np.float)) / \
                (lrSchedule.Tout**2)
            if TissueCount < 0.1:
                continue

            # update console
            sys.stdout.write("\tProcessing tile (%d, %d) " +
                             "(X = %06.0f, Y = %06.0f), %d of %d\n"
                             % (i, j, Schedule.X[i, j], Schedule.Y[i, j],
                                i*Schedule.X.shape[1]+j+1, Schedule.X.size))

            # start preprocessing timer
            Start = time.time()

            # read in color tile
            Tile = I.read_region((int(Schedule.X[i, j]),
                                  int(Schedule.Y[i, j])),
                                 Schedule.Level,
                                 (Schedule.Tout, Schedule.Tout))
            Tile = np.asarray(Tile)
            Tile = Tile[:, :, :3]

            # resize tile if necessary
            if Schedule.Factor != 1.0:
                Tile = sp.misc.imresize(Tile, Schedule.Factor,
                                        interp='bicubic')

            # normalize color of tile
            Tile = cn.ReinhardNorm(Tile, TargetMu, TargetSigma,
                                    SourceColor.Mu, SourceColor.Sigma)

            # deconvolve tile into hematoxylin and eosin
            Unmixed = cd.ColorDeconvolution(Tile, W)
            Hematoxylin = Unmixed.Stains[:, :, 0].astype(np.uint8)
            Eosin = Unmixed.Stains[:, :, 1].astype(np.uint8)

            # get pre-processing time
            StopPreprocess = time.time()

            # perform minimum model segmentation on hematoxylin channel
            Label, Refined, Seeds = nc.nuclear_cut(Hematoxylin, Sigma=25,
                                                   SigmaMin=4*(2**0.5),
                                                   SigmaMax=7*(2**0.5), r=10,
                                                   MinArea=20, MinWidth=5,
                                                   Background=1e-4,
                                                   Smoothness=1e-4)

            # condense label image
            Refined = lb.CondenseLabel(Refined)

            # get segmentation time
            StopSegmentation = time.time()              

            # extract features from label image
            TileFeatures = ft.FeatureExtraction(Refined, Hematoxylin, Eosin,
                                                K=128, Fs=6, Delta=8)

            # extract names from tile features
            Names = list(TileFeatures.columns)

            # extract contours from label image and add offsets
            X, Y = lb.TraceLabel(Refined, Connectivity=8)
            X = [x+Schedule.X[i, j] for x in X]
            Y = [y+Schedule.Y[i, j] for y in Y]
            bX += X
            bY += Y

            # extract centroids and add offsets
            cX.append(TileFeatures.values[:,0] + Schedule.X[i, j])
            cY.append(TileFeatures.values[:,1] + Schedule.Y[i, j])

            # get feature extraction time
            StopExtraction = time.time()

            # append features, centroids, boundaries to lists
            Features.append(TileFeatures.values[:, 2:])

            # record elapsed time and update console
            Elapsed.append(np.array([StopPreprocess-Start,
                                     StopSegmentation-StopPreprocess,
                                     StopExtraction-StopSegmentation]))
            sys.stdout.write("\t\t\tPreprocessing: %f sec, segmentation:" + \
                             " %f sec, feature extraction: %f sec\n"
                             % (StopPreprocess-Start,
                                StopSegmentation-StopPreprocess,
                                StopExtraction-StopSegmentation))

            # write out tile image with boundaries embedded
            Bounds = sg.EmbedBounds(Tile, lb.LabelPerimeter(Refined),
                                    Color=[0, 255, 255])
            io.imsave(Output + Sample + ".%06.0f.%06.0f.tif" % \
                      (Schedule.X[i, j], Schedule.Y[i, j]), Bounds)

    # concatentate tile results for feature arrays, centroids and timings
    Features = np.vstack(Features)
    cX = np.hstack(cX)
    cY = np.hstack(cY)
    Elapsed = np.vstack(Elapsed)

    # perform qc on features to remove nan, inf values
    Discard = np.nonzero(np.isnan(Features))
    Features = np.delete(Features, Discard, 0)
    cX = np.delete(cX, Discard)
    cY = np.delete(cY, Discard)
    bX = [x for index, x in enumerate(bX) if index not in Discard]
    bY = [y for index, y in enumerate(bY) if index not in Discard]

    # calculate ranges over features and number of objects
    N = Features.shape[0]
    Mins = np.min(Features, axis=0)
    Maxes = np.max(Features, axis=0)

    # merge colinear points for boundary compression
    for i in np.arange(len(bX)):
        bX[i], bY[i] = ut.MergeColinear(bX[i], bY[i])

    # save features, boundaries as text
    SegFile = open(Output + Slide[0:-4] + '.seg.txt', mode='w', )
    SegFile.write("Slide\tX\tY\tScanMag\tAnalysisMag\t")
    for i in np.arange(2, len(Names)):
        SegFile.write(Names[i])
    SegFile.write("Boundaries: x1,y1;x2,y2;...\n")
    for i in np.arangeThe following code-snippet can help get an idea of what (Features.shape[0]):
        Row = Slide[0:-4] + "\t%.2f\t%.2f\t%.0f\t%.0f\t" % \
            (cX[i], cY[i], Schedule.Magnification /
             Schedule.Scale, Magnification)
        Row += ''.join(['%f\t' % x for x in Features[i,:]])
        Row += ''.join(['%d,%d;' % (x, y) for x,y in zip(bX[i], bY[i])])
        Row += '\n'
        SegFile.write(Row)

    # convert types and write outputs to matlab file
    sp.io.savemat(Output + Slide[0:-4] + '.features.mat',
                  {'bX': np.array(bX, dtype=object),
                   'bY': np.array(bX, dtype=object),
                   'Features': Features, 'N': N, 'Mins': Mins, 'Maxes': Maxes,
                   'Names': Names, 'Elapsed': Elapsed})

We want to simplify this code as much as possible.

cdeepakroy commented 7 years ago

CC: @cooperlab @manthey @jbeezley @zachmullen @dgutman

manthey commented 7 years ago
cdeepakroy commented 7 years ago

I like the idea of using a dictionary as the return argument of the iterator over a tuple. It keeps it flexible.

I prefer getting the cropped image if i want to say detect nuclei in all tiles. But maybe there is a use-case for getting the un-cropped image and a mask telling what is valid. Can't think of one now. but i feel, they can always check the size and pad the image if needed.

We may not support this now, but i want to put it out for the sake of discussion, we might need a separate iterator for overlapping tiles. For non over-lapping tiles, if a cell is at the boundary of a tile, then it pretty much will go undetected.

add/delete layers: This might be useful if we want to say increase/decrease the layers of the image pyramid and re-store the images. But this function is not a priority now but a good-to-have.

cooperlab commented 7 years ago

@manthey @cdeepakroy

Some general comments - enabling access by resolution/magnification instead of layer would go really far to make life easier. There are some tricky choices to make in this case, like how close is close enough for magnification. We should talk these through as a group. We also need to pick a convention for coordinates. In OpenSlide (x,y) is always in reference to the base layer (native scanned layer). This can get annoying, but they alternatives can too. No matter what I say we stick with pixel coordinates instead of percentages or physical coordinates (microns), otherwise fractional pixels and numerical precision come into play.

Histomics will almost always want buffers. That's not to say we should not enable PIL. I just don't have a use for it.

re iterator For tiles at the right/bottom edge, OpenSlide has some internal definition of blank value that is uses to fill pixels not in the image out to the dimensions requested. I'm agnostic on how this should work - if you don't return the out-of-bounds then we need to do some additional math if indexing the image linearly. HistomicsTK should not assume square images anywhere so this should be fine. If you do return these blank areas then we need to provide some awareness about what portions are in bounds so the blank values don't bias computations.

re add/delete layers: I don't think we need to cover this now. If someone is generating a whole-slide image then there are some other problems to address, like compression and selection of format.

re tiles at each resolution we need an easy way to request the same region at different resolutions. This alone will make life so much easier.

re regions I agree that getRegion works well.

cooperlab commented 7 years ago

@cdeepakroy defining overlap for tiles is important for future plans. Let's bake this in now.

My feeling is that Histomics functions should never assume a square image. There may be people using this to analyze their desktop light microscope screen captures, etc. We should probably check this at testing. If this is the case then returning a cropped image is fine.

cdeepakroy commented 7 years ago

Based on the discussion, I think we can start by implementing the following:

Those two features themselves will make the code-snippet i pasted above so much shorter.

@manthey Do you want to take this up? I think you are most familiar with all the TileSource classes in large_image.

I will then modify functions in HistomicsTK to use it.

manthey commented 7 years ago

@cdeepakroy I will work on it.

manthey commented 7 years ago

Added in PR #77.