jump-cellpainting / 2024_Chandrasekaran_NatureMethods

BSD 3-Clause "New" or "Revised" License
52 stars 10 forks source link

Table of contents generated with markdown-toc

Image-based features from the cell images were extracted using CellProfiler and assembled as single cell profiles, which were aggregated, annotated, normalized and feature selected using pycytominer. Image-based features were also extracted using DeepProfiler which were annotated and spherized. The resulting profiles were analyzed using the notebooks in this repo. Steps for reproducing the data in this repository are outlined below.

Step 1: Download cell images

Cell images are available on a S3 bucket. The images can be downloaded using the command

batch=<BATCH NAME>
aws s3 sync \
  --no-sign-request \
  s3://cellpainting-gallery/cpg0000-jump-pilot/source_4/images/${batch}/ . 

The ${batch} is one of the six batches mentioned below.

You can test out download for a single file using:

suffix=images/2020_11_04_CPJUMP1/images/BR00117010__2020-11-08T18_18_00-Measurement1/Images/r01c01f01p01-ch1sk1fk1fl1.tiff

aws s3 cp \
  --no-sign-request \
  s3://cellpainting-gallery/cpg0000-jump-pilot/source_4/${suffix} \
  .

Note: If you'd like to just browse the data, it's a lot easier to do so using a storage browser.

The following are various kinds of image-related or experiment-related metadata.

Batch and Plate metadata

There are six batches of data - 2020_11_04_CPJUMP1, 2020_11_18_CPJUMP1_TimepointDay1, 2020_11_19_TimepointDay4, 2020_12_02_CPJUMP1_2WeeksTimePoint, 2020_12_07_CPJUMP1_4WeeksTimePoint and 2020_12_08_CPJUMP1_Bleaching. Each batch either contains a single experiment of multiple experiments. Details about all the experimental conditions ia available in the associated manuscript.

experimental-metadata.tsv contains all the experimental metadata for each plate in each batch. The following is the description of each of the columns in the file

Image metadata

The folder for each 384-well plate typically contains images from nine sites for each well (for some wells 7,8 or 16 sites were imaged). The (x,y) coordinates of sites are available in the Metadata_PositionX and Metadata_PositionY columns of the load_data.csv.gz files in the load_data_csv folder. There are eight images per site (five from the fluorescent channels and three brightfield images). The names of the image files follow the naming convention - rXXcXXfXXp01-chXXsk1fk1fl1.tiff where

Cell bounding boxes and segmentation masks have not been provided.

Plate map and Perturbation Metadata

Plate map and Metadata are available in the metadata/ folder and also from https://github.com/jump-cellpainting/JUMP-Target.

Step 2: Extract features using CellProfiler and DeepProfiler

After downloading the images, use the CellProfiler pipelines in pipelines/2020_11_04_CPJUMP1 and follow the instructions in the profiling handbook up until chapter 5.3 to generate the well-level aggregated CellProfiler profiles.

Instead of regenerating the CellProfiler features, they can also be downloaded from the S3 bucket

batch = <BATCH NAME>
aws s3 cp \
  --no-sign-request \
  --recursive \
  s3://cellpainting-gallery/cpg0000-jump-pilot/source_4/workspace/backend/${batch}/ . 

where ${batch} is one of the six batches mentioned above.

The .sqlite files contain single-cell image-based profiles while the .csv files contain the well-level aggregated profiles.

See this blog post for the meaning of (CellProfiler-derived) Cell Painting features. Samples Cell Painting images can be found in the example_images folder.

To extract features using a pretrained neural network using DeepProfiler, follow the README.md instructions, which creates well-level profiles.

Step 3: Process the profiles using pycytominer

After generating the well-level CellProfiler-based features, use Pycytominer to add metadata from metadata/moa, normalize the profiles to the whole plate and to the negative controls, separately, and the filter out invariant and redundant features.

To regenerate all the profiles, clone this repo, download the files and activate the conda environment. Before issuing the following commands, Install Miniconda.

git clone https://github.com/jump-cellpainting/2024_Chandrasekaran_NatureMethods
cd 2024_Chandrasekaran_NatureMethods
git lfs pull
git submodule update --init --recursive
conda env create --force --file environment.yml
conda activate profiling

Then run the pycytominer workflow with the command

./run.sh

This creates the profiles in the profiles/ folder for all the plates in each batch. The folder for each plate contains the following files

File name Description
<plate_ID>.csv.gz Aggregated profiles
<plate_ID>_augmented.csv Metadata annotated profiles
<plate_ID>_normalized.csv.gz MAD robustized to whole plate profiles
<plate_ID>_normalized_negcon.csv.gz MAD robustized to negative control profiles
<plate_ID>_normalized_feature_select_plate.csv.gz Feature selected normalized to whole plate profiles
<plate_ID>_normalized_feature_select_negcon_plate.csv.gz Feature selected normalized to negative control profiles

Annotated DeepProfiler profiles are spherized using this notebook.

Step 4: Run the benchmark script

The benchmark scripts compute Average Precision (AP) for various retrieval tasks, such as, retrieving replicates against negative controls, retrieving perturbation pairs against non-pairs, and retrieving gene-compound pairs against non-pairs. AP was calculated using the Feature selected normalized to negative control profiles (well-level profiles).

To run the benchmark script activate the conda environment in benchmark/

conda env create --force --file benchmark/environment.yml
conda activate analysis

Then run the jupyter notebooks (benchmark/1.calculate-map-cp.ipynb, benchmark/2.calculate-map-dp.ipynb, and benchmark/3.generate-map-figure.ipynb) to create the figures in benchmark/figues/.

Data Organization

The following is the description of relevant files and contents of the relevant folders in this repo.

Maintenance plan

We have provided our maintenance plan in maintenance_plan.md.

Compute resources

For segmentation and feature extraction by CellProfiler, each plate of images took on average 30 minutes to process, using a fleet of 200 m4.xlarge spot instances (800 vCPUs), which cost approximately $10 per plate. Aggregation into mean profiles takes 12-18 hours, though can be parallelized onto a single large machine, at the total cost of <$1 per plate. For profile processing with pycytominer, each plate took under two minutes, using a local machine (Intel Core i9 with 16 GB memory)

DeepProfiler took around 8 hours to extract features from ~280.000 images in a p3.2xlarge with a single Tesla V100-SXM2 GPU. Note that cell locations were previously precomputed with the CellProfiler segmentation pipeline.

Running the benchmark notebooks took an hour in a local machine (Intel Core i9 with 16 GB memory).

License

We use a dual license in this repository. We license the source code as BSD 3-Clause, and license the data, results, and figures as CC0 1.0.

Manuscript

A manuscript describing the contents of this repository is on biorxiv.