fix #104: drop samples not in wells.csv.gz

afermg commented 2 months ago

After our discussions related to #104 and this private issue, we came to the conclusion that we should drop the samples that are missing in {crispr, orf, compounds}.csv.gz. These were breaking automated workflows. I also include the crispr dataset in case there is a problem with compression down the line (its also 64Kb, so doesn't make much of a difference).

In a similar topic, it's worth thinking about migrating data to parquets at some point, which would enable lazy-loading metadata fields without downloading entire files. Let me know if that's a possibility to consider so I can open a related issue.

The code to generate these new files is shown below, and its dependencies are specified here (alongside a poetry.lock file), but the important ones are:

Dependencies

python = ">=3.10, <3.11"
s3path = "^0.5.0"
boto3 = ">=1.33.1"
polars = "^0.19.19"
pooch = "^1.7.0"
pyarrow = "^14.0.1"

Code

"""
JCP ids in {crispr, orf, compound} dataset but not on well dataset.
Update the crispr, orf and compound dataframes to remove text and save them into the folder.
"""

import polars as pl
import gzip

import polars as pl
import pooch

def get_table(table_name: str) -> pl.DataFrame:
    # Obtained from broad_portrait
    METADATA_LOCATION = (
        "https://github.com/jump-cellpainting/datasets/raw/"
        "baacb8be98cfa4b5a03b627b8cd005de9f5c2e70/metadata/"
        "{}.csv.gz"
    )
    METAFILE_HASH = {
        "compound": "a6e18f8728ab018bd03fe83e845b6c623027c3baf211e7b27fc0287400a33052",
        "well": "677d3c1386d967f10395e86117927b430dca33e4e35d9607efe3c5c47c186008",
        "crispr": "979f3c4e863662569cc36c46eaff679aece2c4466a3e6ba0fb45752b40d2bd43",
        "orf": "fbd644d8ccae4b02f623467b2bf8d9762cf8a224c169afa0561fedb61a697c18",
        "plate": "745391d930627474ec6e3083df8b5c108db30408c0d670cdabb3b79f66eaff48",
    }

    return pl.read_csv(
        pooch.retrieve(
            url=METADATA_LOCATION.format(table_name),
            known_hash=METAFILE_HASH[table_name],
        ),
        use_pyarrow=True,
    )

well_jcp = set(get_table("well")["Metadata_JCP2022"])
datasets = ("compound", "crispr", "orf")
d = {}
for dataset in datasets:
    dataset_jcp = get_table(dataset)["Metadata_JCP2022"]
    n_original = len(dataset_jcp)
    d[dataset] = set(dataset_jcp).intersection(well_jcp)
    print(f"Dataset {dataset} contains {n_original-len(d[dataset])} fewer entries")
"""
Dataset compound contains 957 fewer entries
Dataset crispr contains 0 fewer entries
Dataset orf contains 10 fewer entries
"""
# %% Save dataset
for name,dset in d.items():
    with gzip.open(f"{name}.csv.gz", "wb") as f:
        get_table(name).filter(pl.col("Metadata_JCP2022").is_in(dset)).write_csv(f)

shntnu commented 1 month ago

@afermg

You have this:

Dataset orf contains 10 fewer entries

See

https://github.com/jump-cellpainting/datasets/issues/104#issuecomment-2035695119

This means that we are dropping JCP IDs that are present in the pilot datasets, right? Dropping compounds seems fine; I'm only worried about ORFs (because all 10 ORFs are in pilots)

afermg commented 1 month ago

@afermg

You have this:
Dataset orf contains 10 fewer entries
See

#104 (comment)

This means that we are dropping JCP IDs that are present in the pilot datasets, right? Dropping compounds seems fine; I'm only worried about ORFs (because all 10 ORFs are in pilots)

I'm pretty sure we had this discussion with both yourself and @niranjchandrasekaran and the final conclusion was to drop them, as they break most automated means of processing JUMP data. Please do correct me if I am mistaken though.

shntnu commented 1 month ago

You are right. Please proceed.

-Shantanu

On Thu, May 16, 2024 at 2:50 PM Alán F. Muñoz @.***> wrote:

@afermg https://github.com/afermg

You have this:

Dataset orf contains 10 fewer entries

See

104 (comment)

https://github.com/jump-cellpainting/datasets/issues/104#issuecomment-2035695119

This means that we are dropping JCP IDs that are present in the pilot datasets, right? Dropping compounds seems fine; I'm only worried about ORFs (because all 10 ORFs are in pilots)

I'm pretty sure we had this discussion with both yourself and @niranjchandrasekaran https://github.com/niranjchandrasekaran and the final conclusion was to drop them, as they break most automated means of processing JUMP data. Please do correct me if I am mistaken though.

— Reply to this email directly, view it on GitHub https://github.com/jump-cellpainting/datasets/pull/109#issuecomment-2115967876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJHQPHZNC5SNY5QWD5YC5TZCT5XLAVCNFSM6AAAAABHCUX6BKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVHE3DOOBXGY . You are receiving this because your review was requested.Message ID: @.***>

jump-cellpainting / datasets

fix #104: drop samples not in wells.csv.gz #109

Dependencies

Code

104 (comment)