jump-cellpainting / 2024_Chandrasekaran_NatureMethods

BSD 3-Clause "New" or "Revised" License
56 stars 11 forks source link

DeepProfiler availability #47

Closed wongdanr closed 2 years ago

wongdanr commented 2 years ago

Hello @niranjchandrasekaran, Are the (well-level) profiles from DeepProfiler available for download? Wondering where I can find them. Thank you!

niranjchandrasekaran commented 2 years ago

Hi @wongdanr, DeepProfiler features are currently available only for ten plates.

wongdanr commented 2 years ago

Thanks @niranjchandrasekaran appreciate it. Are there any plans to generate the rest of the profiles for the other plates? I'm trying to generate them myself, but I'm new to DeepProfiler and I'm learning that it is not that easy to do.

niranjchandrasekaran commented 2 years ago

Hi @wongdanr, at the moment we are not planning to generate the DeepProfiler features for the rest of the plates. Please continue to ask your questions either in the repo for the handbook or in the slack channel and I am sure the DeepProfiler users will be able to help you.

wongdanr commented 2 years ago

Thanks @niranjchandrasekaran. In preparation for applying DeepProfiler to this dataset, I need single cell locations and their corresponding well and site. From the sqlite files provided in the Step 2's README's S3 bucket, I don't see any mapping from the single cell features to the corresponding well and site (only mappings from single cell features to things like index, TableNumber, ImageNumber, etc.). Is this mapping to well/site available? Thanks!

niranjchandrasekaran commented 2 years ago

I am tagging in @johnarevalo, who will likely know how to extract the single cell locations from the SQLite file.

bethac07 commented 2 years ago

@wongdanr There should be a bunch of columns for Metadata, including Metadata_Well and Metadata_Site; there should also be a couple of different columns (not entirely but sufficiently so for these purposes) for Location_Center_X and Location_Center_Y

wongdanr commented 2 years ago

Thanks @bethac07! Exactly what I was looking for. Does anyone know why there are 16 "sites" when there are only 9 image fields? How do you map sites to field? Aren't these synonymous?

bethac07 commented 2 years ago

Most plates had 9 sites, but some more and some fewer - can you clarify which plate? But yes, typically "site" and "field" are use interchangeably

wongdanr commented 2 years ago

I see thank you very much!

wongdanr commented 2 years ago

Hi @niranjchandrasekaran where is the model you used to generate the DeepProfiler embeddings? Thanks!

niranjchandrasekaran commented 2 years ago

Hi @wongdanr, John used EfficientNet with pretrained features. More details here and in the section, Deep learning feature extraction, of the manuscript.

wongdanr commented 2 years ago

Thank you @niranjchandrasekaran and @johnarevalo. The README says to download the pretrained model, but I don't see it in the deep_profiles/ directory. Can this be shared? Appreciate it!

johnarevalo commented 2 years ago

Hi @wongdanr,

In the last version of DeepProfiler, by setting profile: checkpoint to "None" (as string) in the jump.json file, DeepProfiler automatically downloads the pretrained weights.

I'll open an issue to update the README.

wongdanr commented 2 years ago

Thank you @johnarevalo. Is it possible to provide the updated jump.json file used to generate the deep profiles? I tried using the one in deep_profiles/inputs/config/ but I think this might be out of date? Thank you!

johnarevalo commented 2 years ago

We used the file you mentioned in this github repo. Could you paste the output of the profile command after setting checkpoint: "None" in jump.json ?

wongdanr commented 2 years ago

Thanks @johnarevalo, sorry it's working now actually. It looks like a new key needs to be added to the json file called 'label_smoothing' ('train':'model':'params':'label_smoothing').

johnarevalo commented 2 years ago

Thanks for debugging this @wongdanr. DeepProfiler is still under development and backward incompatibilities can be added without notice.

wongdanr commented 2 years ago

Thank you very much @johnarevalo I was able to profile the CPJUMP1 compound data. From the README, I wasn't sure how the various profiles in outputs/results/features/ were aggregated. Did you simply take the median of the various profiles within a well to get a well-level aggregation vector, and then apply Pycytominer to the well-level median vectors to get the final deep profiles reported in the repo. Thanks!

johnarevalo commented 2 years ago

We used the build_profiles.py script to aggregate the extracted features using mean: https://github.com/jump-cellpainting/2021_Chandrasekaran_submitted/blob/58583b45e01e06da7a642dd92b7f955e2fe37226/deep_profiles/utils/build_profiles.py#L25

wongdanr commented 2 years ago

Thank you @johnarevalo. Once the .parquet file is created, how do I create the normalized profiles of all the plates using pycytominer? I cloned the repo neurips_cpjump1. It seems like the neurips_cpjump1/run.sh script processes only 10 of the plates. I'm not sure where I can specify the created .parquet file as an argument. Thanks!

johnarevalo commented 2 years ago

It's great you have generated the features in the .parquet file!. This file contains 2 metadata columns (Metadata_Plate, Metadata_Well) and 6400 feature columns as described in the readme. The PLATE_ID.csv.gz files are just subsets of the parquet file split by Metadata_Plate.

You can obtain such splits with pandas:

df = pd.read_parquet('profiles.parquet')
groups = df.groupby('Metadata_Plate')
for plate_id, group in groups:
    group.to_csv(f'{plate_id}.csv.gz', compression='gzip', index=False)

I haven't tested it, but I guess you get the idea.

wongdanr commented 2 years ago

Great thank you! I'm wondering more about how to generate the normalized versions of those though, such as the "augmented" and "spherized" versions that are included in the repo. @johnarevalo

johnarevalo commented 2 years ago

You can follow the profiling recipe repo to generate the augmented profiles and to run the rest of the pipeline.

wongdanr commented 2 years ago

Thanks @johnarevalo, where can I find details about the pre-trained model that gets automatically downloaded? Was this model trained to classify drug perturbation type or plate? I see in the jump.json file that "label_field": "Treatment" but I also see that "targets": ["Metadata_Plate"] and I'm not quite sure which one is used for the classification label. Also was this model trained on just the JUMP1 compound data?

johnarevalo commented 2 years ago

The train section in jump.json is considered only when train operation runs. In this case we ran the profile operation to extract features using a model pretrained with ImageNet. So any value set in train, including the ones you mention, are ignored by DeepProfiler.

DeepProfiler relies on efficientnet library. You can check the list of available models and the details of each. DeepProfiler uses EfficientNetB0 As default.