Enhance TileSource API to support analysis of all tiles of a whole-slide image

This issue involves enhancing the interface of the TileSource class to support the analysis of all tiles in a whole-slide image with a simplified interface.

To start a discussion. I have summarize below the requirements that are on my mind:

an iterator to iterate through all tiles of a whole-slide image at any desired magnification: The iterator should probably return a tuple with (top-left-x, top-left-y, width, height, tile-image). @cooperlab Do we need any other information. This iterator is probably the most important among other requirements listed here.
A member function to add/delete layers of the image pyramid
We need a member function grab corresponding tiles at each resolution. Lee can you please elaborate on the need for this. There is a function called getTile(x, y, z) which probably does this. @manthey Can you please confirm or explain what this does?
A member function to get an arbitrary region in the whole slide image: The function could take (top-left-x, top-left-y, width, height) as arguments. Looks like TileSource.getRegion(width. height, **kwargs) already does this?
Good docstrings that clearly explain what all the member functions do.

Below is code-snippet that segments nuclei and extracts some features in all tiles using openslide and HistomicsTK to get an idea of where things can be simplified:

from histomicstk.preprocessing import color_deconvolution as cd
from histomicstk.preprocessing import color_normalization as cn
from histomicstk import features as ft
from histomicstk.segmentation import label as lb
from histomicstk import segmentation as sg
import histomicstk.utils.TilingSchedule as ts
import histomicstk.utils.ConvertSchedule as cs
import histomicstk.utils as ut

import nucleicuts as nc
import numpy as np
import openslide as os
import os.path as pth
import scipy as sp
import skimage.io as io
import sys
import time

def process_slide(Input, Output, Magnification=20, T=4096, Sigma=25,
                  SigmaMin=4*(2**0.5), SigmaMax=7*(2**0.5), r=10,
                  MinArea=20, MinWidth=5, Background=1e-4,
                  Smoothness=1e-4):
    """
    """

    #define color normalization and deconvolution parameters
    W = np.array([[0.650, 0.072, 0],
                     [0.704, 0.990, 0],
                     [0.286, 0.105, 0]])
    TargetMu = np.array([8.63234435, -0.11501964, 0.03868433])
    TargetSigma = np.array([0.57506023, 0.10403329, 0.01364062])

    # parse filename of slide into path and slide name
    Path, Slide = pth.split(Input)
    Path = Path + "/" 
    Sample = Slide[0:Slide.index('.')]

    # check if slide can be opened
    try:
        I = os.OpenSlide(Input)
    except os.OpenSlideError:
        sys.stdout.write("Cannot find file '" + Input + "'")
        # return
    except os.OpenSlideUnsupportedFormatError:
        sys.stdout.write("Slide format not supported. "
                         "Consult OpenSlide documentation")
        # return

    # generate tiling of slide
    Schedule = ts.TilingSchedule(Input, Magnification, T)

    # convert tiling schedule to low-resolution for tissue mapping
    lrSchedule = cs.ConvertSchedule(Schedule, 1.25)
    lrHeight = I.level_dimensions[lrSchedule.Level][1]
    lrWidth = I.level_dimensions[lrSchedule.Level][0]
    LR = I.read_region((0, 0), lrSchedule.Level, (lrWidth, lrHeight))
    LR = np.asarray(LR)
    LR = LR[:, :, :3]
    if lrSchedule.Factor != 1.0:
        LR = sp.misc.imresize(LR, lrSchedule.Factor)
        lrHeight = LR.shape[0]
        lrWidth = LR.shape[1]
    Mask = sg.SimpleMask(LR)    
    MaskHeight = lrSchedule.Tout * lrSchedule.Y.shape[0]
    MaskWidth = lrSchedule.Tout * lrSchedule.X.shape[1]
    LRMask = np.zeros((MaskHeight, MaskWidth), dtype=np.uint8)
    LRMask[0:lrHeight, 0:lrWidth] = Mask

    # update console
    sys.stdout.write("Processing %s at %f magnification, resizing factor %f\n"
                     % (Slide, Magnification, Schedule.Factor))

    # define normalization parameters
    SourceColor = cn.ReinhardSample(Input, Magnification, 0.05, T)

    # initialize containers to store results
    Features = []  # features
    bX = []  # boundary coordinates
    bY = []
    cX = []  # centroids
    cY = []
    Elapsed = []  # timing data

    # iterate through tiles, normalizing each and segmenting them
    for i in np.arange(Schedule.X.shape[0]):
        for j in np.arange(Schedule.X.shape[1]):

            # check tissue coverage
            Mask = LRMask[
                i * lrSchedule.Tout:(i + 1) * lrSchedule.Tout,
                j * lrSchedule.Tout:(j + 1) * lrSchedule.Tout].astype(np.uint8)
            TissueCount = sum(Mask.flatten().astype(np.float)) / \
                (lrSchedule.Tout**2)
            if TissueCount < 0.1:
                continue

            # update console
            sys.stdout.write("\tProcessing tile (%d, %d) " +
                             "(X = %06.0f, Y = %06.0f), %d of %d\n"
                             % (i, j, Schedule.X[i, j], Schedule.Y[i, j],
                                i*Schedule.X.shape[1]+j+1, Schedule.X.size))

            # start preprocessing timer
            Start = time.time()

            # read in color tile
            Tile = I.read_region((int(Schedule.X[i, j]),
                                  int(Schedule.Y[i, j])),
                                 Schedule.Level,
                                 (Schedule.Tout, Schedule.Tout))
            Tile = np.asarray(Tile)
            Tile = Tile[:, :, :3]

            # resize tile if necessary
            if Schedule.Factor != 1.0:
                Tile = sp.misc.imresize(Tile, Schedule.Factor,
                                        interp='bicubic')

            # normalize color of tile
            Tile = cn.ReinhardNorm(Tile, TargetMu, TargetSigma,
                                    SourceColor.Mu, SourceColor.Sigma)

            # deconvolve tile into hematoxylin and eosin
            Unmixed = cd.ColorDeconvolution(Tile, W)
            Hematoxylin = Unmixed.Stains[:, :, 0].astype(np.uint8)
            Eosin = Unmixed.Stains[:, :, 1].astype(np.uint8)

            # get pre-processing time
            StopPreprocess = time.time()

            # perform minimum model segmentation on hematoxylin channel
            Label, Refined, Seeds = nc.nuclear_cut(Hematoxylin, Sigma=25,
                                                   SigmaMin=4*(2**0.5),
                                                   SigmaMax=7*(2**0.5), r=10,
                                                   MinArea=20, MinWidth=5,
                                                   Background=1e-4,
                                                   Smoothness=1e-4)

            # condense label image
            Refined = lb.CondenseLabel(Refined)

            # get segmentation time
            StopSegmentation = time.time()              

            # extract features from label image
            TileFeatures = ft.FeatureExtraction(Refined, Hematoxylin, Eosin,
                                                K=128, Fs=6, Delta=8)

            # extract names from tile features
            Names = list(TileFeatures.columns)

            # extract contours from label image and add offsets
            X, Y = lb.TraceLabel(Refined, Connectivity=8)
            X = [x+Schedule.X[i, j] for x in X]
            Y = [y+Schedule.Y[i, j] for y in Y]
            bX += X
            bY += Y

            # extract centroids and add offsets
            cX.append(TileFeatures.values[:,0] + Schedule.X[i, j])
            cY.append(TileFeatures.values[:,1] + Schedule.Y[i, j])

            # get feature extraction time
            StopExtraction = time.time()

            # append features, centroids, boundaries to lists
            Features.append(TileFeatures.values[:, 2:])

            # record elapsed time and update console
            Elapsed.append(np.array([StopPreprocess-Start,
                                     StopSegmentation-StopPreprocess,
                                     StopExtraction-StopSegmentation]))
            sys.stdout.write("\t\t\tPreprocessing: %f sec, segmentation:" + \
                             " %f sec, feature extraction: %f sec\n"
                             % (StopPreprocess-Start,
                                StopSegmentation-StopPreprocess,
                                StopExtraction-StopSegmentation))

            # write out tile image with boundaries embedded
            Bounds = sg.EmbedBounds(Tile, lb.LabelPerimeter(Refined),
                                    Color=[0, 255, 255])
            io.imsave(Output + Sample + ".%06.0f.%06.0f.tif" % \
                      (Schedule.X[i, j], Schedule.Y[i, j]), Bounds)

    # concatentate tile results for feature arrays, centroids and timings
    Features = np.vstack(Features)
    cX = np.hstack(cX)
    cY = np.hstack(cY)
    Elapsed = np.vstack(Elapsed)

    # perform qc on features to remove nan, inf values
    Discard = np.nonzero(np.isnan(Features))
    Features = np.delete(Features, Discard, 0)
    cX = np.delete(cX, Discard)
    cY = np.delete(cY, Discard)
    bX = [x for index, x in enumerate(bX) if index not in Discard]
    bY = [y for index, y in enumerate(bY) if index not in Discard]

    # calculate ranges over features and number of objects
    N = Features.shape[0]
    Mins = np.min(Features, axis=0)
    Maxes = np.max(Features, axis=0)

    # merge colinear points for boundary compression
    for i in np.arange(len(bX)):
        bX[i], bY[i] = ut.MergeColinear(bX[i], bY[i])

    # save features, boundaries as text
    SegFile = open(Output + Slide[0:-4] + '.seg.txt', mode='w', )
    SegFile.write("Slide\tX\tY\tScanMag\tAnalysisMag\t")
    for i in np.arange(2, len(Names)):
        SegFile.write(Names[i])
    SegFile.write("Boundaries: x1,y1;x2,y2;...\n")
    for i in np.arangeThe following code-snippet can help get an idea of what (Features.shape[0]):
        Row = Slide[0:-4] + "\t%.2f\t%.2f\t%.0f\t%.0f\t" % \
            (cX[i], cY[i], Schedule.Magnification /
             Schedule.Scale, Magnification)
        Row += ''.join(['%f\t' % x for x in Features[i,:]])
        Row += ''.join(['%d,%d;' % (x, y) for x,y in zip(bX[i], bY[i])])
        Row += '\n'
        SegFile.write(Row)

    # convert types and write outputs to matlab file
    sp.io.savemat(Output + Slide[0:-4] + '.features.mat',
                  {'bX': np.array(bX, dtype=object),
                   'bY': np.array(bX, dtype=object),
                   'Features': Features, 'N': N, 'Mins': Mins, 'Maxes': Maxes,
                   'Names': Names, 'Elapsed': Elapsed})

We want to simplify this code as much as possible.

CC: @cooperlab @manthey @jbeezley @zachmullen @dgutman

iterator: I would suggest the iterator return a dictionary (rather than a tuple) to make it easy to extend it if ever needed. If we aren't at the maximum resolution, then, for completeness, this would contain:
- x, y: (left, top) coordinate in current magnification pixels
- width, height: size of current tile in pixels
- tile: tile image
and optionally (if we want to be explicit about the data),
- level: level of the current tile
- global_x, global_y: (left, top) coordinate in maximum-resolution pixels
- global_width, global_height: size of the current tile in maximum resolution_pixels
- global_level: maximum level in this image
- magnification: magnification factor
For tiles at the right and bottom, do we return cropped tiles (so that they aren't necessarily the same width and height as all of the others), or do we return information on what portion of the tile should be used?

I assume we should yield tiles that are PIL images.

Some tile sources do not have every tile at the maximum resolution (for instance, in areas where the slide is very plain, a level 5 tile may be a solid color, and then no level 6 tiles are encoded in that area). I assume the iterator should return tiles for these areas anyway (or there should be an option?).
add/delete layers: we don't currently generate or alter the image pyramid. Can you elaborate on what you want here?
tiles at each resolution: getTile(x, y, z) returns either a PIL image or the raw tile from a tile source. z is the level, x and y are the tile numbers (0, 0 on level 0, 0-1, 0-1 on level 1, 0-3, 0-3 on level 2, etc.). If you don't have a tile at the specified level because the lower level represents it, you can pass pilImageAllowed=True and sparseFallback=True to getTile, and a tile will be generated for that level.
regions: getRegion should already do what you want. We can extend it to take an explicit magnification when we know that from the file metadata.

I like the idea of using a dictionary as the return argument of the iterator over a tuple. It keeps it flexible.

I prefer getting the cropped image if i want to say detect nuclei in all tiles. But maybe there is a use-case for getting the un-cropped image and a mask telling what is valid. Can't think of one now. but i feel, they can always check the size and pad the image if needed.

We may not support this now, but i want to put it out for the sake of discussion, we might need a separate iterator for overlapping tiles. For non over-lapping tiles, if a cell is at the boundary of a tile, then it pretty much will go undetected.

add/delete layers: This might be useful if we want to say increase/decrease the layers of the image pyramid and re-store the images. But this function is not a priority now but a good-to-have.

@manthey @cdeepakroy

Some general comments - enabling access by resolution/magnification instead of layer would go really far to make life easier. There are some tricky choices to make in this case, like how close is close enough for magnification. We should talk these through as a group. We also need to pick a convention for coordinates. In OpenSlide (x,y) is always in reference to the base layer (native scanned layer). This can get annoying, but they alternatives can too. No matter what I say we stick with pixel coordinates instead of percentages or physical coordinates (microns), otherwise fractional pixels and numerical precision come into play.

Histomics will almost always want buffers. That's not to say we should not enable PIL. I just don't have a use for it.

re iterator For tiles at the right/bottom edge, OpenSlide has some internal definition of blank value that is uses to fill pixels not in the image out to the dimensions requested. I'm agnostic on how this should work - if you don't return the out-of-bounds then we need to do some additional math if indexing the image linearly. HistomicsTK should not assume square images anywhere so this should be fine. If you do return these blank areas then we need to provide some awareness about what portions are in bounds so the blank values don't bias computations.

re add/delete layers: I don't think we need to cover this now. If someone is generating a whole-slide image then there are some other problems to address, like compression and selection of format.

re tiles at each resolution we need an easy way to request the same region at different resolutions. This alone will make life so much easier.

re regions I agree that getRegion works well.

@cdeepakroy defining overlap for tiles is important for future plans. Let's bake this in now.

My feeling is that Histomics functions should never assume a square image. There may be people using this to analyze their desktop light microscope screen captures, etc. We should probably check this at testing. If this is the case then returning a cropped image is fine.

Based on the discussion, I think we can start by implementing the following:

tile iterator The iterator would return a dictionary as you suggested with the following keys:
- x, y: (left, top) coordinate in current magnification pixels
- height, width: size of current tile in pixels
- tile: cropped tile image
- level: level of the current tile. None if not present
- magnification: magnification of the current tile
- gx, gy - (left, top) coordinate in maximum-resolution pixels
- gwidth, gheight: size of of the current tile in maximum resolution pixels
- I think we can get global_level from the tile source object so we can omit it.
A function to get a region (specified with gx, gy, gwidth, gheight) at any desired resolution (magnification factor). TileSource.getRegion can be enhanced to do this.

Those two features themselves will make the code-snippet i pasted above so much shorter.

@manthey Do you want to take this up? I think you are most familiar with all the TileSource classes in large_image.

I will then modify functions in HistomicsTK to use it.

@cdeepakroy I will work on it.

girder / large_image

Enhance TileSource API to support analysis of all tiles of a whole-slide image #75