jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Missing compound (JCP2022_088778) #104

Closed afermg closed 1 month ago

afermg commented 3 months ago

Hey @shntnu, @ashah03 was running an analysis with jump_portrait and found a compound with no plate metadata (JCP2022_088778). I checked that it is not portrait by manually grepping the csv.gz files.

How to reproduce 1) Present in compounds dataset

wget https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz 
gunzip -q compound.csv.gz
grep "JCP2022_088778" compound.csv | wc -l

Outputs 1

  1. Not present in wells
    wget https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz
    gunzip -q well.csv.gz
    grep "JCP2022_088778" well.csv | wc -l

    Outputs 0

That means that either this perturbation was dropped or (more likely) the metadata for some plates is missing.

niranjchandrasekaran commented 3 months ago

This perturbation may have been dropped at some point. I doubt that this is due to plates missing because even if one plate is missing for some reason, for that compound to not be present in wells.csv.gz, four other plates from four other sources must also be missing.

While running your analysis, if you come across other such compounds, please do let us know.

afermg commented 3 months ago

Wrote a quick script that uses jump_portrait to just fetch them all.

#!/usr/bin/env jupyter

"""JCP ids in {crispr, orf, compound} dataset but not on well dataset."""

from jump_portrait.fetch import get_table

well_jcp = set(get_table("well")["Metadata_JCP2022"])
datasets = ("compound", "crispr", "orf")
d = {}
for dataset in datasets:
    dataset_jcp = get_table(dataset)["Metadata_JCP2022"]
    d[dataset] = set(dataset_jcp) - well_jcp
print(d)

Produces this list, which includes of our "smoking gun" compound.

``` {'compound': {'JCP2022_000003', 'JCP2022_000058', 'JCP2022_000145', 'JCP2022_000225', 'JCP2022_000546', 'JCP2022_000672', 'JCP2022_000729', 'JCP2022_000738', 'JCP2022_000767', 'JCP2022_000833', 'JCP2022_000887', 'JCP2022_001084', 'JCP2022_001149', 'JCP2022_001178', 'JCP2022_001186', 'JCP2022_001283', 'JCP2022_001387', 'JCP2022_001556', 'JCP2022_001561', 'JCP2022_001829', 'JCP2022_001874', 'JCP2022_002180', 'JCP2022_002292', 'JCP2022_002294', 'JCP2022_002313', 'JCP2022_002330', 'JCP2022_002450', 'JCP2022_002534', 'JCP2022_002571', 'JCP2022_002577', 'JCP2022_002617', 'JCP2022_002669', 'JCP2022_002729', 'JCP2022_003090', 'JCP2022_003204', 'JCP2022_003350', 'JCP2022_003428', 'JCP2022_003473', 'JCP2022_003533', 'JCP2022_003562', 'JCP2022_003592', 'JCP2022_003682', 'JCP2022_003705', 'JCP2022_003716', 'JCP2022_003791', 'JCP2022_003957', 'JCP2022_003967', 'JCP2022_004010', 'JCP2022_004028', 'JCP2022_004041', 'JCP2022_004285', 'JCP2022_004590', 'JCP2022_004596', 'JCP2022_004891', 'JCP2022_005136', 'JCP2022_005247', 'JCP2022_005487', 'JCP2022_005500', 'JCP2022_005709', 'JCP2022_005903', 'JCP2022_006131', 'JCP2022_006285', 'JCP2022_006357', 'JCP2022_006370', 'JCP2022_006518', 'JCP2022_006528', 'JCP2022_006608', 'JCP2022_006839', 'JCP2022_007064', 'JCP2022_007147', 'JCP2022_007188', 'JCP2022_007207', 'JCP2022_007272', 'JCP2022_007343', 'JCP2022_007654', 'JCP2022_007718', 'JCP2022_007770', 'JCP2022_007798', 'JCP2022_007814', 'JCP2022_007818', 'JCP2022_007944', 'JCP2022_008021', 'JCP2022_008226', 'JCP2022_008247', 'JCP2022_008312', 'JCP2022_008365', 'JCP2022_008378', 'JCP2022_008680', 'JCP2022_008813', 'JCP2022_008832', 'JCP2022_008897', 'JCP2022_009577', 'JCP2022_009779', 'JCP2022_010497', 'JCP2022_010838', 'JCP2022_011158', 'JCP2022_011279', 'JCP2022_011351', 'JCP2022_011376', 'JCP2022_011380', 'JCP2022_011456', 'JCP2022_011470', 'JCP2022_011755', 'JCP2022_011783', 'JCP2022_011803', 'JCP2022_011827', 'JCP2022_011828', 'JCP2022_011889', 'JCP2022_012102', 'JCP2022_012412', 'JCP2022_012451', 'JCP2022_012661', 'JCP2022_012931', 'JCP2022_013123', 'JCP2022_013315', 'JCP2022_013323', 'JCP2022_013338', 'JCP2022_013427', 'JCP2022_013567', 'JCP2022_013873', 'JCP2022_014017', 'JCP2022_014181', 'JCP2022_014214', 'JCP2022_014279', 'JCP2022_014290', 'JCP2022_014357', 'JCP2022_014424', 'JCP2022_014700', 'JCP2022_014703', 'JCP2022_014743', 'JCP2022_014872', 'JCP2022_014920', 'JCP2022_014943', 'JCP2022_015075', 'JCP2022_015128', 'JCP2022_015248', 'JCP2022_015598', 'JCP2022_015715', 'JCP2022_015733', 'JCP2022_015773', 'JCP2022_015814', 'JCP2022_015939', 'JCP2022_015987', 'JCP2022_016049', 'JCP2022_016130', 'JCP2022_016147', 'JCP2022_016250', 'JCP2022_016364', 'JCP2022_016371', 'JCP2022_016380', 'JCP2022_016724', 'JCP2022_016766', 'JCP2022_017156', 'JCP2022_017222', 'JCP2022_017249', 'JCP2022_017393', 'JCP2022_017532', 'JCP2022_017701', 'JCP2022_017739', 'JCP2022_018021', 'JCP2022_018248', 'JCP2022_018406', 'JCP2022_018471', 'JCP2022_018479', 'JCP2022_018513', 'JCP2022_018957', 'JCP2022_018985', 'JCP2022_018990', 'JCP2022_019057', 'JCP2022_019186', 'JCP2022_019320', 'JCP2022_019334', 'JCP2022_019486', 'JCP2022_019722', 'JCP2022_019748', 'JCP2022_019985', 'JCP2022_019997', 'JCP2022_020073', 'JCP2022_020275', 'JCP2022_020454', 'JCP2022_020595', 'JCP2022_020805', 'JCP2022_021098', 'JCP2022_021120', 'JCP2022_021237', 'JCP2022_021335', 'JCP2022_021457', 'JCP2022_021649', 'JCP2022_021667', 'JCP2022_021804', 'JCP2022_021957', 'JCP2022_022040', 'JCP2022_022503', 'JCP2022_022614', 'JCP2022_022878', 'JCP2022_023013', 'JCP2022_023197', 'JCP2022_023222', 'JCP2022_023231', 'JCP2022_023431', 'JCP2022_023582', 'JCP2022_023782', 'JCP2022_023847', 'JCP2022_023862', 'JCP2022_023886', 'JCP2022_023935', 'JCP2022_024455', 'JCP2022_024491', 'JCP2022_024620', 'JCP2022_024648', 'JCP2022_024834', 'JCP2022_025033', 'JCP2022_025155', 'JCP2022_025311', 'JCP2022_025471', 'JCP2022_025513', 'JCP2022_025581', 'JCP2022_025737', 'JCP2022_025817', 'JCP2022_025859', 'JCP2022_026053', 'JCP2022_026063', 'JCP2022_026117', 'JCP2022_026202', 'JCP2022_026436', 'JCP2022_026439', 'JCP2022_026585', 'JCP2022_026898', 'JCP2022_027170', 'JCP2022_027266', 'JCP2022_027313', 'JCP2022_027372', 'JCP2022_027375', 'JCP2022_027644', 'JCP2022_027851', 'JCP2022_027918', 'JCP2022_027993', 'JCP2022_028042', 'JCP2022_028149', 'JCP2022_028226', 'JCP2022_028416', 'JCP2022_028507', 'JCP2022_028669', 'JCP2022_028840', 'JCP2022_029114', 'JCP2022_029263', 'JCP2022_029321', 'JCP2022_029527', 'JCP2022_029539', 'JCP2022_029644', 'JCP2022_029866', 'JCP2022_030017', 'JCP2022_030171', 'JCP2022_030410', 'JCP2022_030792', 'JCP2022_030867', 'JCP2022_031131', 'JCP2022_031253', 'JCP2022_031324', 'JCP2022_031334', 'JCP2022_031426', 'JCP2022_031460', 'JCP2022_031574', 'JCP2022_031600', 'JCP2022_031671', 'JCP2022_032059', 'JCP2022_032145', 'JCP2022_032335', 'JCP2022_032341', 'JCP2022_032443', 'JCP2022_032459', 'JCP2022_032694', 'JCP2022_032942', 'JCP2022_033028', 'JCP2022_033229', 'JCP2022_033305', 'JCP2022_033333', 'JCP2022_033367', 'JCP2022_033449', 'JCP2022_033477', 'JCP2022_033608', 'JCP2022_033960', 'JCP2022_034013', 'JCP2022_034040', 'JCP2022_034177', 'JCP2022_034327', 'JCP2022_034403', 'JCP2022_034557', 'JCP2022_034643', 'JCP2022_034656', 'JCP2022_034880', 'JCP2022_035482', 'JCP2022_035611', 'JCP2022_035789', 'JCP2022_035850', 'JCP2022_036098', 'JCP2022_036462', 'JCP2022_036572', 'JCP2022_036663', 'JCP2022_037077', 'JCP2022_037113', 'JCP2022_037115', 'JCP2022_037163', 'JCP2022_037315', 'JCP2022_037372', 'JCP2022_037424', 'JCP2022_037494', 'JCP2022_037650', 'JCP2022_037681', 'JCP2022_037818', 'JCP2022_037856', 'JCP2022_037897', 'JCP2022_037898', 'JCP2022_037961', 'JCP2022_038059', 'JCP2022_038302', 'JCP2022_038385', 'JCP2022_038712', 'JCP2022_039048', 'JCP2022_039245', 'JCP2022_039360', 'JCP2022_039404', 'JCP2022_039447', 'JCP2022_039470', 'JCP2022_039597', 'JCP2022_039732', 'JCP2022_039744', 'JCP2022_039921', 'JCP2022_040164', 'JCP2022_040722', 'JCP2022_040803', 'JCP2022_041203', 'JCP2022_041413', 'JCP2022_041477', 'JCP2022_041480', 'JCP2022_041691', 'JCP2022_042007', 'JCP2022_042121', 'JCP2022_042234', 'JCP2022_042237', 'JCP2022_042251', 'JCP2022_042380', 'JCP2022_042397', 'JCP2022_042419', 'JCP2022_042469', 'JCP2022_042507', 'JCP2022_042638', 'JCP2022_042699', 'JCP2022_042797', 'JCP2022_042947', 'JCP2022_043248', 'JCP2022_043383', 'JCP2022_043384', 'JCP2022_043747', 'JCP2022_043786', 'JCP2022_043926', 'JCP2022_044173', 'JCP2022_044215', 'JCP2022_044244', 'JCP2022_044421', 'JCP2022_044498', 'JCP2022_044658', 'JCP2022_044804', 'JCP2022_044818', 'JCP2022_044937', 'JCP2022_044948', 'JCP2022_045002', 'JCP2022_045049', 'JCP2022_045311', 'JCP2022_045512', 'JCP2022_045621', 'JCP2022_045660', 'JCP2022_046102', 'JCP2022_046160', 'JCP2022_046492', 'JCP2022_046618', 'JCP2022_046655', 'JCP2022_046818', 'JCP2022_047065', 'JCP2022_047171', 'JCP2022_047237', 'JCP2022_047389', 'JCP2022_047597', 'JCP2022_047684', 'JCP2022_047837', 'JCP2022_047971', 'JCP2022_048474', 'JCP2022_048516', 'JCP2022_048969', 'JCP2022_049099', 'JCP2022_049279', 'JCP2022_049351', 'JCP2022_049364', 'JCP2022_049506', 'JCP2022_049518', 'JCP2022_050031', 'JCP2022_050124', 'JCP2022_050130', 'JCP2022_050284', 'JCP2022_050569', 'JCP2022_050723', 'JCP2022_050732', 'JCP2022_050799', 'JCP2022_050811', 'JCP2022_051135', 'JCP2022_051207', 'JCP2022_051235', 'JCP2022_051243', 'JCP2022_051341', 'JCP2022_051442', 'JCP2022_051528', 'JCP2022_051586', 'JCP2022_051593', 'JCP2022_051648', 'JCP2022_051658', 'JCP2022_051939', 'JCP2022_051954', 'JCP2022_051960', 'JCP2022_051967', 'JCP2022_052044', 'JCP2022_052154', 'JCP2022_052238', 'JCP2022_052383', 'JCP2022_052450', 'JCP2022_052527', 'JCP2022_052689', 'JCP2022_052692', 'JCP2022_052920', 'JCP2022_053140', 'JCP2022_053374', 'JCP2022_053483', 'JCP2022_053518', 'JCP2022_053549', 'JCP2022_053714', 'JCP2022_054038', 'JCP2022_054460', 'JCP2022_054638', 'JCP2022_054852', 'JCP2022_054950', 'JCP2022_055034', 'JCP2022_055360', 'JCP2022_055528', 'JCP2022_055596', 'JCP2022_055707', 'JCP2022_055744', 'JCP2022_055930', 'JCP2022_056199', 'JCP2022_056214', 'JCP2022_056296', 'JCP2022_056316', 'JCP2022_056590', 'JCP2022_056834', 'JCP2022_056859', 'JCP2022_057038', 'JCP2022_057067', 'JCP2022_057081', 'JCP2022_057083', 'JCP2022_057579', 'JCP2022_057649', 'JCP2022_057666', 'JCP2022_057710', 'JCP2022_058045', 'JCP2022_058401', 'JCP2022_058436', 'JCP2022_058669', 'JCP2022_058734', 'JCP2022_058769', 'JCP2022_058856', 'JCP2022_058865', 'JCP2022_059001', 'JCP2022_059039', 'JCP2022_059104', 'JCP2022_059145', 'JCP2022_059320', 'JCP2022_059321', 'JCP2022_059445', 'JCP2022_059470', 'JCP2022_059732', 'JCP2022_059920', 'JCP2022_059925', 'JCP2022_060093', 'JCP2022_060368', 'JCP2022_060660', 'JCP2022_060685', 'JCP2022_060845', 'JCP2022_060987', 'JCP2022_061413', 'JCP2022_061444', 'JCP2022_061453', 'JCP2022_061499', 'JCP2022_061501', 'JCP2022_061705', 'JCP2022_061889', 'JCP2022_061965', 'JCP2022_062057', 'JCP2022_062110', 'JCP2022_062517', 'JCP2022_062521', 'JCP2022_062529', 'JCP2022_062637', 'JCP2022_062651', 'JCP2022_062738', 'JCP2022_062965', 'JCP2022_063012', 'JCP2022_063165', 'JCP2022_063392', 'JCP2022_063413', 'JCP2022_063501', 'JCP2022_063503', 'JCP2022_063570', 'JCP2022_063614', 'JCP2022_063900', 'JCP2022_064235', 'JCP2022_064339', 'JCP2022_064482', 'JCP2022_064601', 'JCP2022_064738', 'JCP2022_064769', 'JCP2022_065027', 'JCP2022_065092', 'JCP2022_065283', 'JCP2022_065407', 'JCP2022_065409', 'JCP2022_065555', 'JCP2022_065620', 'JCP2022_065639', 'JCP2022_065719', 'JCP2022_065729', 'JCP2022_065824', 'JCP2022_065902', 'JCP2022_066118', 'JCP2022_066504', 'JCP2022_066582', 'JCP2022_066770', 'JCP2022_066813', 'JCP2022_067008', 'JCP2022_067079', 'JCP2022_067096', 'JCP2022_067098', 'JCP2022_067204', 'JCP2022_067261', 'JCP2022_067325', 'JCP2022_067551', 'JCP2022_067835', 'JCP2022_067896', 'JCP2022_067951', 'JCP2022_067976', 'JCP2022_068021', 'JCP2022_068076', 'JCP2022_068120', 'JCP2022_068289', 'JCP2022_068526', 'JCP2022_068576', 'JCP2022_068607', 'JCP2022_068719', 'JCP2022_068938', 'JCP2022_069158', 'JCP2022_069282', 'JCP2022_069317', 'JCP2022_069330', 'JCP2022_069464', 'JCP2022_069495', 'JCP2022_069654', 'JCP2022_069682', 'JCP2022_069705', 'JCP2022_069756', 'JCP2022_069838', 'JCP2022_069885', 'JCP2022_069920', 'JCP2022_069992', 'JCP2022_070249', 'JCP2022_070321', 'JCP2022_070328', 'JCP2022_070433', 'JCP2022_070476', 'JCP2022_070527', 'JCP2022_070681', 'JCP2022_070747', 'JCP2022_070810', 'JCP2022_070865', 'JCP2022_070924', 'JCP2022_071011', 'JCP2022_071140', 'JCP2022_071181', 'JCP2022_071351', 'JCP2022_071658', 'JCP2022_071899', 'JCP2022_071900', 'JCP2022_071932', 'JCP2022_072210', 'JCP2022_072576', 'JCP2022_072870', 'JCP2022_072969', 'JCP2022_073292', 'JCP2022_073300', 'JCP2022_073736', 'JCP2022_074078', 'JCP2022_074449', 'JCP2022_074671', 'JCP2022_074885', 'JCP2022_075038', 'JCP2022_075068', 'JCP2022_075146', 'JCP2022_075188', 'JCP2022_075527', 'JCP2022_075567', 'JCP2022_075568', 'JCP2022_075837', 'JCP2022_075926', 'JCP2022_076050', 'JCP2022_076206', 'JCP2022_076410', 'JCP2022_076424', 'JCP2022_076618', 'JCP2022_076893', 'JCP2022_077006', 'JCP2022_077037', 'JCP2022_077056', 'JCP2022_077640', 'JCP2022_077713', 'JCP2022_077715', 'JCP2022_077893', 'JCP2022_077904', 'JCP2022_077973', 'JCP2022_078010', 'JCP2022_078021', 'JCP2022_078086', 'JCP2022_078096', 'JCP2022_078125', 'JCP2022_078341', 'JCP2022_078575', 'JCP2022_078873', 'JCP2022_078906', 'JCP2022_078955', 'JCP2022_079430', 'JCP2022_079471', 'JCP2022_079560', 'JCP2022_079809', 'JCP2022_079859', 'JCP2022_079888', 'JCP2022_079951', 'JCP2022_080122', 'JCP2022_080303', 'JCP2022_080321', 'JCP2022_080530', 'JCP2022_080651', 'JCP2022_080663', 'JCP2022_080682', 'JCP2022_081255', 'JCP2022_081472', 'JCP2022_081488', 'JCP2022_081783', 'JCP2022_081787', 'JCP2022_081843', 'JCP2022_081859', 'JCP2022_081929', 'JCP2022_081961', 'JCP2022_081978', 'JCP2022_081979', 'JCP2022_082118', 'JCP2022_082286', 'JCP2022_082459', 'JCP2022_082527', 'JCP2022_082568', 'JCP2022_082831', 'JCP2022_082890', 'JCP2022_082937', 'JCP2022_083077', 'JCP2022_083111', 'JCP2022_083142', 'JCP2022_083277', 'JCP2022_083367', 'JCP2022_083406', 'JCP2022_083517', 'JCP2022_083650', 'JCP2022_083959', 'JCP2022_083966', 'JCP2022_084049', 'JCP2022_084151', 'JCP2022_084307', 'JCP2022_084312', 'JCP2022_084354', 'JCP2022_084636', 'JCP2022_084646', 'JCP2022_084695', 'JCP2022_085204', 'JCP2022_085320', 'JCP2022_085339', 'JCP2022_085407', 'JCP2022_085421', 'JCP2022_085567', 'JCP2022_085727', 'JCP2022_085796', 'JCP2022_085856', 'JCP2022_085963', 'JCP2022_086251', 'JCP2022_086487', 'JCP2022_086728', 'JCP2022_086813', 'JCP2022_086924', 'JCP2022_087089', 'JCP2022_087147', 'JCP2022_087299', 'JCP2022_087672', 'JCP2022_087995', 'JCP2022_088002', 'JCP2022_088124', 'JCP2022_088174', 'JCP2022_088258', 'JCP2022_088295', 'JCP2022_088337', 'JCP2022_088487', 'JCP2022_088755', 'JCP2022_088778', 'JCP2022_088867', 'JCP2022_088873', 'JCP2022_089185', 'JCP2022_089213', 'JCP2022_089335', 'JCP2022_089362', 'JCP2022_089384', 'JCP2022_089484', 'JCP2022_089614', 'JCP2022_089969', 'JCP2022_090068', 'JCP2022_090306', 'JCP2022_090348', 'JCP2022_090442', 'JCP2022_090474', 'JCP2022_090475', 'JCP2022_090645', 'JCP2022_091303', 'JCP2022_091314', 'JCP2022_091372', 'JCP2022_091670', 'JCP2022_091688', 'JCP2022_091703', 'JCP2022_091732', 'JCP2022_091801', 'JCP2022_091840', 'JCP2022_091842', 'JCP2022_091994', 'JCP2022_092053', 'JCP2022_092207', 'JCP2022_092497', 'JCP2022_092500', 'JCP2022_092561', 'JCP2022_092620', 'JCP2022_092669', 'JCP2022_092742', 'JCP2022_092752', 'JCP2022_092847', 'JCP2022_092978', 'JCP2022_093068', 'JCP2022_093172', 'JCP2022_093174', 'JCP2022_093422', 'JCP2022_093506', 'JCP2022_093541', 'JCP2022_093623', 'JCP2022_093682', 'JCP2022_093742', 'JCP2022_093771', 'JCP2022_093913', 'JCP2022_094053', 'JCP2022_094082', 'JCP2022_094196', 'JCP2022_094319', 'JCP2022_094807', 'JCP2022_095016', 'JCP2022_095094', 'JCP2022_095234', 'JCP2022_095360', 'JCP2022_095389', 'JCP2022_095597', 'JCP2022_095608', 'JCP2022_095884', 'JCP2022_096046', 'JCP2022_096525', 'JCP2022_096632', 'JCP2022_096757', 'JCP2022_096785', 'JCP2022_096947', 'JCP2022_097210', 'JCP2022_097243', 'JCP2022_097392', 'JCP2022_097395', 'JCP2022_097407', 'JCP2022_097436', 'JCP2022_097442', 'JCP2022_097679', 'JCP2022_098008', 'JCP2022_098029', 'JCP2022_098184', 'JCP2022_098510', 'JCP2022_098513', 'JCP2022_098976', 'JCP2022_099350', 'JCP2022_099405', 'JCP2022_099643', 'JCP2022_100024', 'JCP2022_100125', 'JCP2022_100137', 'JCP2022_100152', 'JCP2022_100393', 'JCP2022_100444', 'JCP2022_100514', 'JCP2022_100606', 'JCP2022_100823', 'JCP2022_101019', 'JCP2022_101271', 'JCP2022_101483', 'JCP2022_101517', 'JCP2022_101862', 'JCP2022_101970', 'JCP2022_102075', 'JCP2022_102126', 'JCP2022_102405', 'JCP2022_102666', 'JCP2022_102751', 'JCP2022_102765', 'JCP2022_102845', 'JCP2022_102906', 'JCP2022_103009', 'JCP2022_103010', 'JCP2022_103027', 'JCP2022_103207', 'JCP2022_103256', 'JCP2022_103330', 'JCP2022_103413', 'JCP2022_103469', 'JCP2022_103594', 'JCP2022_103839', 'JCP2022_103938', 'JCP2022_103971', 'JCP2022_104022', 'JCP2022_104031', 'JCP2022_104057', 'JCP2022_104111', 'JCP2022_104296', 'JCP2022_104315', 'JCP2022_104470', 'JCP2022_104601', 'JCP2022_105109', 'JCP2022_105139', 'JCP2022_105221', 'JCP2022_105255', 'JCP2022_105346', 'JCP2022_105541', 'JCP2022_105649', 'JCP2022_105721', 'JCP2022_105752', 'JCP2022_105942', 'JCP2022_106138', 'JCP2022_106153', 'JCP2022_106215', 'JCP2022_106301', 'JCP2022_106440', 'JCP2022_106682', 'JCP2022_106709', 'JCP2022_106847', 'JCP2022_107127', 'JCP2022_107210', 'JCP2022_107330', 'JCP2022_107375', 'JCP2022_107456', 'JCP2022_107583', 'JCP2022_107605', 'JCP2022_107652', 'JCP2022_107693', 'JCP2022_107697', 'JCP2022_108320', 'JCP2022_108340', 'JCP2022_108350', 'JCP2022_108527', 'JCP2022_108660', 'JCP2022_108852', 'JCP2022_108892', 'JCP2022_108988', 'JCP2022_109089', 'JCP2022_109110', 'JCP2022_109285', 'JCP2022_109305', 'JCP2022_109461', 'JCP2022_109544', 'JCP2022_109640', 'JCP2022_109733', 'JCP2022_109895', 'JCP2022_109936', 'JCP2022_110045', 'JCP2022_110257', 'JCP2022_110304', 'JCP2022_110429', 'JCP2022_110492', 'JCP2022_110513', 'JCP2022_110578', 'JCP2022_110682', 'JCP2022_110700', 'JCP2022_110724', 'JCP2022_111032', 'JCP2022_111046', 'JCP2022_111219', 'JCP2022_111251', 'JCP2022_111274', 'JCP2022_111308', 'JCP2022_111723', 'JCP2022_111981', 'JCP2022_112107', 'JCP2022_112164', 'JCP2022_112501', 'JCP2022_112561', 'JCP2022_112640', 'JCP2022_112697', 'JCP2022_112724', 'JCP2022_112742', 'JCP2022_112865', 'JCP2022_112945', 'JCP2022_112973', 'JCP2022_112978', 'JCP2022_113141', 'JCP2022_113162', 'JCP2022_113208', 'JCP2022_113324', 'JCP2022_113355', 'JCP2022_113509', 'JCP2022_113791', 'JCP2022_113798', 'JCP2022_114064', 'JCP2022_114073', 'JCP2022_114168', 'JCP2022_114253', 'JCP2022_114284', 'JCP2022_114359', 'JCP2022_114363', 'JCP2022_114576', 'JCP2022_114599', 'JCP2022_114633', 'JCP2022_114660', 'JCP2022_114900', 'JCP2022_114905', 'JCP2022_114925', 'JCP2022_114940', 'JCP2022_115043', 'JCP2022_115117', 'JCP2022_115124', 'JCP2022_115177', 'JCP2022_115310', 'JCP2022_115573', 'JCP2022_115763', 'JCP2022_115969', 'JCP2022_115974', 'JCP2022_116019', 'JCP2022_116131', 'JCP2022_116269', 'JCP2022_116332', 'JCP2022_116641', 'JCP2022_116740'}, 'crispr': set(), 'orf': {'JCP2022_915133', 'JCP2022_915134', 'JCP2022_915135', 'JCP2022_915136', 'JCP2022_915137', 'JCP2022_915138', 'JCP2022_915139', 'JCP2022_915140', 'JCP2022_915141', 'JCP2022_915142'}} ```

Yielding these numbers for compound, crispr and orf respectively:

[len(x) for x in d.values()]

[957, 0, 10]

LMK if you think we should put this somewhere.

afermg commented 3 months ago

Or should we remove them from their respective X.csv.gz?

ashah03 commented 3 months ago

FYI: this compound is present in the harmony cell painting features that @johnarevalo produced -- how is that possible if it got dropped? Unless there are two JCP_IDs for the same SMILES (I matched with john's features based on SMILES)

CCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C

niranjchandrasekaran commented 3 months ago

FYI: this compound is present in the harmony cell painting features that @johnarevalo produced -- how is that possible if it got dropped? Unless there are two JCP_IDs for the same SMILES (I matched with john's features based on SMILES)

Must be the case. If you look at JCP2022_088779, it seems to be pretty much the same as JCP2022_088778 (based on their InChIKey) and that one is present in well.csv.gz

niranjchandrasekaran commented 3 months ago

More than the compounds, I am currently more interested in the ORF reagents that are missing. Perhaps that will help us figure out what's happening with the compounds. I couldn't find the missing ORFs in the metadata file (internal link) that I have been using for ORFs, but I can find them in another file (interal_link). So, the answer must be in the source file that was used to create the metadata files in this repo. Maybe @shntnu already knows why those compounds and ORFs are missing. So I will wait for his thoughts before diving deeper into this.

afermg commented 3 months ago

jump portrait uses the current versions of https://github.com/jump-cellpainting/datasets/tree/main/metadata.

If we want to be precise, the source code is here

    METADATA_LOCATION = (
        "https://github.com/jump-cellpainting/datasets/raw/"
        "baacb8be98cfa4b5a03b627b8cd005de9f5c2e70/metadata/"
        "{}.csv.gz"
    )

IIRC the hash is the same as the current master. I use permalinks for reproducibility.

shntnu commented 3 months ago

Must be the case. If you look at JCP2022_088779, it seems to be pretty much the same as JCP2022_088778 (based on their InChIKey) and that one is present in well.csv.gz

Yep – it is certainly possible that there are a few more listed unique JCP2022s in compound.csv.gz that are present in well.csv.gz. I think it is wise to remove these 957. I'll note that all these 957 are present in the original (internal) source https://github.com/jump-cellpainting/jump-cellpainting/blob/master/3.standardize/standardize_ksiling_jumpmoa_jumptarget2/data/05_release/2022_10_18_JUMP-CP_compound_library_aggregated.csv so we can look up more details there.

So, the answer must be in the source file that was used to create the metadata files in this repo.

Not all compounds that were planned were actually profiled (n=957 apparently, although some might be explained by SMILES inconsistency)

Source files: https://github.com/jump-cellpainting/jump-cellpainting/tree/master/3.standardize/standardize_ksiling_jumpmoa_jumptarget2

shntnu commented 3 months ago

I couldn't find the missing ORFs in the metadata file (internal link) that I have been using for ORFs, but I can find them in another file (interal_link).

These are in Target2 plates cpg0000-jump-pilot[orf] but not in cpg0016-jump[orf]

https://github.com/jump-cellpainting/JUMP-Target/blob/master/JUMP-Target-1_orf_metadata.tsv

These should not be removed

@afermg – it will indeed be great to document these somewhere

afermg commented 3 months ago

How do you think we should approach the missing ORF entries? I don't love the idea of pipelines breaking due to these, but adding them as exceptions in my tools seems like an anti-pattern. Any thoughts @shntnu @niranjchandrasekaran?

afermg commented 3 months ago

In a related topic, could you let me know when the entries have been removed? I will need to update my tools to point to the upgraded versions. Thanks!

shntnu commented 3 months ago

How do you think we should approach the missing ORF entries? I don't love the idea of pipelines breaking due to these, but adding them as exceptions in my tools seems like an anti-pattern. Any thoughts @shntnu @niranjchandrasekaran?

They are are missing in cpg0016 but present in cpg0000. JUMP comprises 4 datasets: cpg000{0,1,2} and cpg0016 so it is not missing per se.

Can you explain why pipelines would break?

shntnu commented 3 months ago

In a related topic, could you let me know when the entries have been removed? I will need to update my tools to point to the upgraded versions. Thanks!

We are versioning this repo; I suspect pegging versions would be the way to go (instead of having a process to report changes). What do you think?

niranjchandrasekaran commented 3 months ago

These are in Target2 plates cpg0000-jump-pilot[orf] but not in cpg0016-jump[orf]

They are are missing in cpg0016 but present in cpg0000. JUMP comprises 4 datasets: cpg000{0,1,2} and cpg0016 so it is not missing per se.

Ah, I thought I recognized the gene names from somewhere. This makes sense.