jump-cellpainting / JUMP-Target

Lists and 384-well plate maps of compounds and genetic perturbations designed to assess connectivity in profiling assays
MIT License
17 stars 5 forks source link

Update JUMP-Target-2_compound_metadata.tsv to include standarized SMILES and related identifiers #32

Closed shntnu closed 3 months ago

shntnu commented 3 months ago

Partially addresses #9

JUMP-Target-2_compound_metadata.tsv was created in https://github.com/jump-cellpainting/datasets-private/pull/86


# Fix Target2 wget \ https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/09f11aefd6b550cfb2d2074e38a65c3be39ddc39/JUMP-Target-2_compound_metadata.tsv \ -O ~/Downloads/JUMP-Target-2_compound_metadata.tsv git clone git@github.com:jump-cellpainting/compound-annotator.git cd compound-annotator git checkout 79596e4 python StandardizeMolecule.py run \ --input="/Users/shsingh/Downloads/JUMP-Target-2_compound_metadata.tsv" \ --output="/Users/shsingh/Downloads/JUMP-Target-2_compound_metadata_standardized.csv" \ --num_cpu=14 \ --augment ``` r library(tidyverse) ``` ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.4 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ✔ purrr 1.0.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package () to force all conflicts to become errors ``` r df1 <- read_tsv("~/Downloads/JUMP-Target-2_compound_metadata.tsv", show_col_types = FALSE, guess_max = Inf) df2 <- read_csv("~/Downloads/JUMP-Target-2_compound_metadata_standardized.csv", show_col_types = FALSE, guess_max = Inf) df3 <- df2 %>% rename(smiles = SMILES_original) %>% select(broad_sample, InChIKey, pert_iname, pubchem_cid, target, pert_type, control_type, smiles, smiles_standardized = SMILES_standardized, InChI_standardized, InChIKey_standardized) ``` ``` r compare::compare(df1, df3 %>% select(-smiles_standardized, -InChI_standardized, -InChIKey_standardized), allowAll = TRUE) ``` TRUE dropped attributes ``` r df3 %>% write_tsv("output/JUMP-Target-2_compound_metadata_updated.tsv") compound <- read_csv("https://raw.githubusercontent.com/jump-cellpainting/datasets/70c3e3554d6982480606c8cdd55cb808c1a796c1/metadata/compound.csv.gz", show_col_types = FALSE, guess_max = 10000) well <- read_csv("https://raw.githubusercontent.com/jump-cellpainting/datasets/70c3e3554d6982480606c8cdd55cb808c1a796c1/metadata/well.csv.gz", show_col_types = FALSE, guess_max = 10000) plate <- read_csv("https://raw.githubusercontent.com/jump-cellpainting/datasets/70c3e3554d6982480606c8cdd55cb808c1a796c1/metadata/plate.csv.gz", show_col_types = FALSE) target2_cpg0016_InChIKey <- plate %>% filter(Metadata_PlateType == "TARGET2") %>% slice_head(n=1) %>% inner_join(well) %>% inner_join(compound) %>% distinct(Metadata_InChIKey) %>% rename(InChIKey = Metadata_InChIKey) ``` Joining with `by = join_by(Metadata_Source, Metadata_Plate)` Joining with `by = join_by(Metadata_JCP2022)` There are no inconsistencies as noted in https://github.com/jump-cellpainting/datasets/issues/80#issuecomment-1787924639 ``` r compare::compare(df3 %>% select(x = InChIKey_standardized) %>% distinct(), target2_cpg0016_InChIKey %>% select(x = InChIKey), allowAll = TRUE) ``` TRUE sorted New ``` r df3 %>% distinct(InChIKey_standardized) %>% count() %>% knitr::kable() ``` | n | |----:| | 302 | Old ``` r df3 %>% distinct(InChIKey) %>% count() %>% knitr::kable() ``` | n | |----:| | 307 | ``` r df3 %>% filter(pert_iname != "DMSO") %>% distinct(InChIKey_standardized) %>% count() %>% knitr::kable() ``` | n | |----:| | 301 |