jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Merging on compounds InChIKey #98

Closed cuongqn closed 3 months ago

cuongqn commented 4 months ago

Hello,

We’re trying to match compounds InChIKey between three JUMP metadata tables (JUMP-MOA compound metadata, JUMP-Target-2 compound metadata, full JUMP compound metadata in this repo) and observed the following overlaps:

Is this behavior expected when merging between the above metadata tables?

Reproducing the behavior


# Load JUMP metadata files
well = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz")
compound = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz")
plate = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/plate.csv.gz")

# Load JUMP-MOA and JUMP-Target-2 compound metadata files
compound_moa = pd.read_csv("https://raw.githubusercontent.com/jump-cellpainting/JUMP-MOA/master/JUMP-MOA_compound_metadata.tsv", sep="\t")
compound_target2 = pd.read_csv("https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/master/JUMP-Target-2_compound_metadata.tsv", sep="\t")

# Merge metadata
metadata = well.merge(compound, on='Metadata_JCP2022', how="left")
metadata = metadata.merge(plate, on=['Metadata_Source', 'Metadata_Plate'])
metadata = metadata[metadata.Metadata_PlateType=="TARGET2"]

# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP compounds after selecting TARGET2 plates
print(set(compound_target2.InChIKey.unique()).intersection(set(metadata.Metadata_InChIKey.unique())).__len__()) # = 182

# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP-MOA
print(set(compound_target2.InChIKey.unique()).intersection(set(compound_moa.InChIKey.unique())).__len__()) # = 18
shntnu commented 3 months ago

Thanks for reporting

This will fixed once we have released the updated map for Target2 via https://github.com/jump-cellpainting/datasets/issues/80 and https://github.com/jump-cellpainting/datasets/issues/86

shntnu commented 3 months ago

This will fixed once we have released the updated map for Target2 via #80 and #86

I believe this should be fixed but please report back if not @cuongqn

```py # Load JUMP metadata files well = pd.read_csv( "https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz" ) compound = pd.read_csv( "https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz" ) plate = pd.read_csv( "https://github.com/jump-cellpainting/datasets/raw/main/metadata/plate.csv.gz" ) # Load JUMP-MOA and JUMP-Target-2 compound metadata files compound_moa = pd.read_csv( "https://raw.githubusercontent.com/jump-cellpainting/JUMP-MOA/master/JUMP-MOA_compound_metadata.tsv", sep="\t", ) compound_target2 = pd.read_csv( "https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/master/JUMP-Target-2_compound_metadata.tsv", sep="\t", ) # Merge metadata metadata = well.merge(compound, on="Metadata_JCP2022", how="left") metadata = metadata.merge(plate, on=["Metadata_Source", "Metadata_Plate"]) metadata_target2 = metadata[metadata.Metadata_PlateType == "TARGET2"] # Get intersection of unique InChIKey between JUMP-Target-2 and JUMP compounds after selecting TARGET2 plates print( set(compound_target2.InChIKey.unique()) .intersection(set(metadata_target2.Metadata_InChIKey.unique())) .__len__() ) # = 302 # Get intersection of unique InChIKey between JUMP-Target-2 and JUMP-MOA print( set(compound_target2.InChIKey.unique()) .intersection(set(compound_moa.InChIKey.unique())) .__len__() ) # = 18 # Get intersection of unique InChIKey between JUMP-MOA and JUMP compounds print( set(compound_moa.InChIKey.unique()) .intersection(set(metadata.Metadata_InChIKey.unique())) .__len__() ) # = 76 # Get set diff of unique InChIKey between JUMP-MOA and JUMP compounds print( set(compound_moa.InChIKey.unique()).difference( set(metadata.Metadata_InChIKey.unique()) ) ) # {'GCWIQUVXWZWCLE-UHFFFAOYSA-N', 'XSIOKTWDEOJMGG-UHFFFAOYSA-O', 'AOJQBABIGYNZOY-UHFFFAOYSA-N', 'ODADKLYLWWCHNB-UHFFFAOYSA-N', 'XEVJUIZOZCFECP-UHFFFAOYSA-N', 'UHAXDAKQGVISBZ-UHFFFAOYSA-N', 'XAPVAKKLQGLNOY-UHFFFAOYSA-N', 'XIXXNJFWPAVKFR-UHFFFAOYSA-N'} # Get set diff of unique InChIKey between JUMP-MOA and JUMP compounds print( set(compound_moa.InChIKey.unique()) .difference(set(metadata.Metadata_InChIKey.unique())) .__len__() ) # 8 # Get set diff of unique InChIKey between JUMP-Target-2 and JUMP compounds print( set(compound_target2.InChIKey.unique()) .difference(set(metadata.Metadata_InChIKey.unique())) .__len__() ) # 0 ```