jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Specify which compounds were nominated by which source #106

Closed shntnu closed 2 months ago

shntnu commented 3 months ago

https://github.com/jump-cellpainting/datasets-private/pull/89

afermg commented 2 months ago

In these cases, am I supposed to review that the proper columns are present in the new dataset or is this related to https://github.com/jump-cellpainting/datasets-private/issues/87? Thanks!

shntnu commented 2 months ago

Sorry for not clarifying the scope of the requested review.

afermg commented 2 months ago

I ran some basic data analysis @shntnu , LMK if anything seems off. Source 4 has very few JCP ids, but maybe that is normal.

#!/usr/bin/env jupyter
import polars as pl

path = "https://github.com/jump-cellpainting/datasets/raw/86c910c55d389c135109b80a6fe36dd4a4140d95/metadata/compound_source.csv.gz"
jcp_sources = pl.read_csv(path)

"""
shape: (127_060, 2)
┌──────────────────────────┬──────────────────┐
│ Metadata_Compound_Source ┆ Metadata_JCP2022 │
│ ---                      ┆ ---              │
│ str                      ┆ str              │
╞══════════════════════════╪══════════════════╡
│ source_1                 ┆ JCP2022_000007   │
│ source_1                 ┆ JCP2022_000009   │
│ …                        ┆ …                │
│ source_9                 ┆ JCP2022_116726   │
│ source_9                 ┆ JCP2022_116750   │
└──────────────────────────┴──────────────────┘
"""
# %% [markdown]
# We can now perform count the sources for each.
# %%
col = "Metadata_Compound_Source"
with pl.Config(tbl_rows=12):
    print(
        jcp_sources.group_by(col)
        .agg(pl.count())
        .with_columns(
            pl.concat_str(
                pl.lit("source_"), pl.col(col).str.replace("source_", "").str.zfill(2)
            ).alias(col)
        )
        .sort(by=col)
    )

"""
┌──────────────────────────┬───────┐
│ Metadata_Compound_Source ┆ count │
│ ---                      ┆ ---   │
│ str                      ┆ u32   │
╞══════════════════════════╪═══════╡
│ source_01                ┆ 11999 │
│ source_02                ┆ 11996 │
│ source_03                ┆ 11060 │
│ source_04                ┆ 367   │
│ source_05                ┆ 12000 │
│ source_06                ┆ 12443 │
│ source_07                ┆ 6061  │
│ source_08                ┆ 12212 │
│ source_09                ┆ 12273 │
│ source_10                ┆ 11979 │
│ source_11                ┆ 12000 │
│ source_15                ┆ 12670 │
└──────────────────────────┴───────┘
 """
afermg commented 2 months ago

I just read your code and see that it does match the original source by source.

niranjchandrasekaran commented 2 months ago

Source 4 has very few JCP ids, but maybe that is normal.

Can confirm that this is expected.

shntnu commented 2 months ago
  • Checked the data, it matches the input in dimension and size. Looks good to me.

Thanks

  • Checked the code, looks okay to me. My one concern is that it reads a local csv instead of pointing to a remote location. I understand it comes from a private repo but I would still suggest to add the original location as a comment before loading it. This ensures that it can be reproduced once the files disappears from the system (and also removes the need of changing the path).

Wise words Now fixed: https://github.com/jump-cellpainting/datasets-private/commit/91cd061a7083d655b50e75d8d91e8316f4d08fb1

  • Checked the schema. Shouldn't it be connected to the Wells file too? As in, the JCP link both tables directly, and we can go from a source to its associated wells.

Good point – this schema is not very nicely normalized so I'd avoid adding more relationships to avoid making it more confusing. I did fix a relationship though 25733fd

I'll approve this PR but if you get the time it'd be nice to link the data source on the code that generates this table.

Done