qupath / qupath

QuPath - Open-source bioimage analysis for research
https://qupath.github.io
GNU General Public License v3.0
1.06k stars 281 forks source link

Exporting measurements is too slow when there are large numbers of objects & measurements #1045

Open petebankhead opened 2 years ago

petebankhead commented 2 years ago

Bug report

Describe the bug The performance of MeasurementExporter is unacceptably slow when large numbers of objects and measurements.

(Although, as we shall see, it's not entirely its fault...)

To Reproduce Steps to reproduce the behavior:

  1. Run cell detection on a large regions (generating >100k cells)
  2. Run an export script that should be limited to export just one measurement per cell, e.g. following https://forum.image.sc/t/qupath-extremely-slow-exporting-detection-measurements/71154/6
  3. Predict how long it should take (a second or two?)
  4. Be disappointed and confused (possibly)

Expected behavior Exporting hundreds of thousands of measurements takes a matter of seconds.

Desktop (please complete the following information):

Additional context The discussion behind this is at https://forum.image.sc/t/qupath-extremely-slow-exporting-detection-measurements/71154

Investigating revealed a few issues:

The first is easy to address, although may not help much.

The second can also be addressed by excluding columns earlier. The third may be tricker, but is needed to help in cases where a full table should be export.

petebankhead commented 2 years ago

Using OS-2.ndpi with ~150k cells, the following script requires 12-15 seconds on a Mac Studio:

import qupath.lib.gui.tools.MeasurementExporter
import qupath.lib.objects.PathCellObject

def project = getProject()
def imagesToExport = [getProjectEntry()]
def separator = "\t"

def columnsToInclude = new String[]{"Name", "Class", "Nucleus: Area"}
def exportType = PathCellObject.class
def outputPath = buildFilePath(PROJECT_BASE_DIR, getProjectEntry().getImageName() + ".tsv")
def outputFile = new File(outputPath)

def exporter  = new MeasurementExporter()
                  .imageList(imagesToExport)            // Images from which measurements will be exported
                  .separator(separator)                 // Character that separates values
                  .includeOnlyColumns(columnsToInclude) // Columns are case-sensitive
                  .exportType(exportType)               // Type of objects to export
                  .exportMeasurements(outputFile)        // Start the export process

print "Done!"

By contrast, the following exports something similar but takes 0.6-0.7 seconds:

// Some kind of file path for the current image
def name = getProjectEntry().getImageName()
name = GeneralTools.getNameWithoutExtension(name)
def path = buildFilePath(PROJECT_BASE_DIR, name + '.tsv')

def cells = getCellObjects()
def measurements = ['Nucleus: Area']

try (def writer = new PrintWriter(path)) {

    // Write header
    def sb = new StringBuilder()
    sb.append('Class')
    for (def measurementName in measurements) {
        sb.append('\t')
        sb.append(measurementName)
    }
    writer.println(sb.toString())

    // Write measurements
    for (def cell in cells) {
        sb.setLength(0)
        sb.append(cell.getPathClass())
        for (def measurementName in measurements) {
            sb.append('\t')
            sb.append(cell.getMeasurementList().getMeasurementValue(measurementName))
        }
        writer.println(sb.toString())
    }

}
println "Written to $path"

Some overhead is expected when using MeasurementExporter, but it should be reduced.

petebankhead commented 2 years ago

So the lack of a buffered stream is probably unimportant, since digging down deeper I see that a PrintWriter is used... which involves some buffering (as far as I can tell). Which may explain why I didn't really spot any clear improvement when using a BufferedOutputStream.

petebankhead commented 2 years ago

Upon further investigation, it's probably worth revising this command. The following methods do much the same thing:

For maintainability, we should try to figure out a way to reuse the same code.