High memory usage for Detector1Pipeline

TheSkyentist commented 1 month ago

Summary:

The Detector1Pipeline is excepionally high, nearly 2 orders of magnitude larger than the input image size, regardless of instrument/mode or input file size. I am using the JWST pipeline installed from HEAD and testing this on macOS and Linux.

Context:

I have been running on NIRISS IMAGE/WFSS data, which have input file sizes of ~100MB. I would frequently use multiprocessing on cluster environments and receive Out Of Memory job cancellations regularly on nodes with 250GB of RAM. To test what was going on, I ran the Stage1 Pipeline on a subset of my NIRISS data to examine the memory usage. I was surprised to find memory footprints well into 8GB, a nearly two orders of magnitude size difference compared to the input file size. I decided to try and test this more robustly, and see if this was true for all JWST modes, or just NIRISS. I have therefore done the following:

Investigation:

1) I queried a single uncal file from each unique JWST mode from MAST, sorting by most recent and what was publicly available. All of my results are using the following files:

MIRI-CORON: "jw01618063001_02101_00001_mirimage_uncal.fits" MIRI-IFU: "jw03375021001_03104_00001-seg001_mirifushort_uncal.fits" MIRI-IMAGE: "jw03730009001_03101_00001-seg004_mirimage_uncal.fits" MIRI-SLIT: "jw04490001001_04101_00045_mirimage_uncal.fits" MIRI-SLITLESS: "jw04498014001_03102_00001-seg001_mirimage_uncal.fits" MIRI-TARGACQ: "jw03375021001_02101_00001-seg001_mirimage_uncal.fits" NIRCAM-CORON: "jw03989010001_03106_00005_nrca2_uncal.fits" NIRCAM-GRISM: "jw03538013004_02101_00002_nrcblong_uncal.fits" NIRCAM-IMAGE: "jw03990558001_02201_00008_nrca2_uncal.fits" NIRCAM-TARGACQ: "jw03989010001_03102_00001_nrcalong_uncal.fits" NIRISS-AMI: "jw04498026001_03108_00001_nis_uncal.fits" NIRISS-IMAGE: "jw04681444001_02201_00002_nis_uncal.fits" NIRISS-SOSS: "jw04476003001_02101_00004_nis_uncal.fits" NIRISS-WFSS: "jw04681446001_06201_00002_nis_uncal.fits" NIRSPEC: "jw01465088001_02101_00010_nrs2_uncal.fits" NIRSPEC-IFU: "jw06641003001_0218v_00001_nrs2_uncal.fits" NIRSPEC-IMAGE: "jw03045018001_05101_00001_nrs2_uncal.fits" NIRSPEC-MSA: "jw04291007001_03101_00001_nrs2_uncal.fits" NIRSPEC-SLIT: "jw06456003001_04101_00001-seg003_nrs2_uncal.fits"

These also span quite a wide range of input sizes, from 172K to 1.4G.

2) I run the Detector 1 Pipeline on all files listed above. I keep track of the process ID and use the command line utility ps to track the memory usage of the the JWST pipeline over time with a granularity of 1s. By comparing against the JWST pipeline output, it is possible to track memory usage as a function of pipeline step and I can create the following summary graph:

NIRISS-WFSS_short.pdf

The problem becomes even more severe when looking at large input files, for example this MIRI/IFU observation requires 80GB of RAM to run the Stage1Pipeline:

MIRI-IFU_short.pdf

Regardless of any mode tried, the memory requirement scales to nearly 2 orders of magnitude larger than the input model size.

After discussion with @jdavies-st, it appears that instictively this is quite a bit higher than one might expect give the operations conducted within the Stage 1 Pipeline. Are there are any strategies for memory footprint mitigation? I am running the JWST pipeline over thousands of files and this is my main limitation.

I have attached the resultant plots from all of my runs in a .zip archive: short.zip

Code

#! /usr/bin/env python

# import os
import argparse
from jwst.pipeline import Detector1Pipeline

# Define the command line arguments
parser = argparse.ArgumentParser(description='Run the JWST pipeline on a FITS file')
parser.add_argument('file', type=str, help='Path to the FITS file')
args = parser.parse_args()
file = args.file

# Define Detector 1 steps (skip everything before jump)
steps = dict(
    persistence=dict(
        save_trapsfilled=False  # Don't save trapsfilled file
    ),
)

# Run the pipeline
Detector1Pipeline.call(file,steps=steps)

#!/bin/bash

if [ "$#" -lt 2 ]; then
  echo "Usage: $0 <output_log_file> <command> [args...]"
  exit 1
fi

# Extract the output log file
OUTPUT_LOG_FILE=$1
shift

# Remove the output log file if it exists
if [ -f "$OUTPUT_LOG_FILE" ]; then
  rm "$OUTPUT_LOG_FILE"
fi

# Run the command in the background
echo $@
"$@" &
CMD_PID=$!

# File to store the memory usage data
MEM_FILE="$OUTPUT_LOG_FILE"
echo "#Timestamp,Memory_Usage" > $MEM_FILE

# Function to get memory usage of the process
get_memory_usage() {
  ps -o rss= -p $CMD_PID 2>/dev/null
}

# Capture memory usage at regular intervals
while kill -0 $CMD_PID 2>/dev/null; do
  TIMESTAMP=$(date "+%s.%N")
  MEM_USAGE=$(get_memory_usage)
  if [ -n "$MEM_USAGE" ]; then
    echo "$TIMESTAMP,$MEM_USAGE" >> $MEM_FILE
  fi
  sleep 1 # Adjust the interval as needed for finer or coarser granularity
done

wait $CMD_PID

braingram commented 1 month ago

Thanks for the excellent detective work!

Similar to https://github.com/spacetelescope/jwst/issues/8668 I think one issue is a bug in the stpipe dependency which should (for these examples) bring the ending memory usage down. Would you be willing to test the stpipe change with one or a few of the examples here (perhaps the MIRI-IFU example is most useful).

However one thing to note is the (sometimes very) large size of the reference files. To use one example, running the jw04681446001_06201_00002_nis_uncal.fits file results in a large ~4 to 5 GB jump in memory during the dark step. This is largely explained by the ~4GB dark reference file https://jwst-crds.stsci.edu/browse/jwst_niriss_dark_0179.fits used for that step. The above linked stpipe change will have no impact on these transient peaks but should prevent the memory from steadily increasing.

TheSkyentist commented 1 month ago

I have promising results. I am now pulling stpipe and jwst from HEAD since https://github.com/spacetelescope/stpipe/pull/171 was merged.

The upgrade to stpipe does reduce the overall memory usage. In many cases it brings it in line with expectations, where the Dark Current step represents the largest usage, for example in MIRI slit observations: MIRI-SLIT_short.pdf

However, in other cases, the usage is still incredibly high. The MIRI usage is now down to 40GB (from 80GB) which represents a significant improvement. However, the Jump step still represents the majority of the usage. Perhaps this is still in line with expectations, but I would expect closer to a 5-10X size in memory with respect to the input image, rather than a 30X usage. MIRI-IFU_short.pdf

I've attached again a zip file of all of single run tests: short.zip

x12red commented 3 weeks ago

I can confirm that the memory usage of the pipeline is insanely high in stage 1 at the moment. I am reducing NIRISS WFSS data right now and all my processes get killed (I run the pipeline in parallel with a bash script). I tried to run it over one single uncal file and the memory usage went up to 65 Gb (for one single uncal file!), which makes the parallelization impossible on my cluster. Actually, it makes impossible using the pipeline also on a normal machine. How should I tackle this?

TheSkyentist commented 3 weeks ago

Are you pulling stpipe and jwst from head? 65GB is quite large for NIRISS WFSS, though I have seen it get that large when the input is large (>1GB). Luckily the current state of the Stage 1 pipeline is that it runs relatively fast, so hopefully even running in serial is okay for your workflow.

I agree that this makes it untenable on a "normal" machine, so hopefully there are still improvements in memory usage coming soon.

spacetelescope / jwst