Closed afermg closed 6 months ago
This perturbation may have been dropped at some point. I doubt that this is due to plates missing because even if one plate is missing for some reason, for that compound to not be present in wells.csv.gz
, four other plates from four other sources must also be missing.
While running your analysis, if you come across other such compounds, please do let us know.
Wrote a quick script that uses jump_portrait to just fetch them all.
#!/usr/bin/env jupyter
"""JCP ids in {crispr, orf, compound} dataset but not on well dataset."""
from jump_portrait.fetch import get_table
well_jcp = set(get_table("well")["Metadata_JCP2022"])
datasets = ("compound", "crispr", "orf")
d = {}
for dataset in datasets:
dataset_jcp = get_table(dataset)["Metadata_JCP2022"]
d[dataset] = set(dataset_jcp) - well_jcp
print(d)
Produces this list, which includes of our "smoking gun" compound.
Yielding these numbers for compound, crispr and orf respectively:
[len(x) for x in d.values()]
[957, 0, 10]
LMK if you think we should put this somewhere.
Or should we remove them from their respective X.csv.gz?
FYI: this compound is present in the harmony cell painting features that @johnarevalo produced -- how is that possible if it got dropped? Unless there are two JCP_IDs for the same SMILES (I matched with john's features based on SMILES)
CCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C
FYI: this compound is present in the harmony cell painting features that @johnarevalo produced -- how is that possible if it got dropped? Unless there are two JCP_IDs for the same SMILES (I matched with john's features based on SMILES)
Must be the case. If you look at JCP2022_088779
, it seems to be pretty much the same as JCP2022_088778
(based on their InChIKey) and that one is present in well.csv.gz
More than the compounds, I am currently more interested in the ORF reagents that are missing. Perhaps that will help us figure out what's happening with the compounds. I couldn't find the missing ORFs in the metadata file (internal link) that I have been using for ORFs, but I can find them in another file (interal_link). So, the answer must be in the source file that was used to create the metadata files in this repo. Maybe @shntnu already knows why those compounds and ORFs are missing. So I will wait for his thoughts before diving deeper into this.
jump portrait uses the current versions of https://github.com/jump-cellpainting/datasets/tree/main/metadata.
If we want to be precise, the source code is here
METADATA_LOCATION = (
"https://github.com/jump-cellpainting/datasets/raw/"
"baacb8be98cfa4b5a03b627b8cd005de9f5c2e70/metadata/"
"{}.csv.gz"
)
IIRC the hash is the same as the current master. I use permalinks for reproducibility.
Must be the case. If you look at
JCP2022_088779
, it seems to be pretty much the same asJCP2022_088778
(based on their InChIKey) and that one is present inwell.csv.gz
Yep – it is certainly possible that there are a few more listed unique JCP2022s in compound.csv.gz
that are present in well.csv.gz
. I think it is wise to remove these 957. I'll note that all these 957 are present in the original (internal) source https://github.com/jump-cellpainting/jump-cellpainting/blob/master/3.standardize/standardize_ksiling_jumpmoa_jumptarget2/data/05_release/2022_10_18_JUMP-CP_compound_library_aggregated.csv so we can look up more details there.
So, the answer must be in the source file that was used to create the metadata files in this repo.
Not all compounds that were planned were actually profiled (n=957 apparently, although some might be explained by SMILES inconsistency)
Source files: https://github.com/jump-cellpainting/jump-cellpainting/tree/master/3.standardize/standardize_ksiling_jumpmoa_jumptarget2
I couldn't find the missing ORFs in the metadata file (internal link) that I have been using for ORFs, but I can find them in another file (interal_link).
These are in Target2 plates cpg0000-jump-pilot[orf]
but not in cpg0016-jump[orf]
https://github.com/jump-cellpainting/JUMP-Target/blob/master/JUMP-Target-1_orf_metadata.tsv
These should not be removed
@afermg – it will indeed be great to document these somewhere
How do you think we should approach the missing ORF entries? I don't love the idea of pipelines breaking due to these, but adding them as exceptions in my tools seems like an anti-pattern. Any thoughts @shntnu @niranjchandrasekaran?
In a related topic, could you let me know when the entries have been removed? I will need to update my tools to point to the upgraded versions. Thanks!
How do you think we should approach the missing ORF entries? I don't love the idea of pipelines breaking due to these, but adding them as exceptions in my tools seems like an anti-pattern. Any thoughts @shntnu @niranjchandrasekaran?
They are are missing in cpg0016
but present in cpg0000
. JUMP comprises 4 datasets: cpg000{0,1,2}
and cpg0016
so it is not missing per se.
Can you explain why pipelines would break?
In a related topic, could you let me know when the entries have been removed? I will need to update my tools to point to the upgraded versions. Thanks!
We are versioning this repo; I suspect pegging versions would be the way to go (instead of having a process to report changes). What do you think?
These are in Target2 plates cpg0000-jump-pilot[orf] but not in cpg0016-jump[orf]
They are are missing in cpg0016 but present in cpg0000. JUMP comprises 4 datasets: cpg000{0,1,2} and cpg0016 so it is not missing per se.
Ah, I thought I recognized the gene names from somewhere. This makes sense.
Hey @shntnu, @ashah03 was running an analysis with jump_portrait and found a compound with no plate metadata (JCP2022_088778). I checked that it is not portrait by manually grepping the csv.gz files.
How to reproduce 1) Present in compounds dataset
Outputs 1
Outputs 0
That means that either this perturbation was dropped or (more likely) the metadata for some plates is missing.