jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Only .parquet files in profiles directory #71

Open lewismervin1 opened 1 year ago

lewismervin1 commented 1 year ago

We noticed that the expected workspace folder structure for profiles (https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure), i.e.:

└── profiles
    └── 2021_04_26_Batch1
        ├── BR00117035
        │   ├── BR00117035.csv.gz
        │   ├── BR00117035_augmented.csv.gz
        │   ├── BR00117035_normalized.csv.gz
        │   ├── BR00117035_normalized_feature_select_negcon_plate.csv.gz
        │   ├── BR00117035_normalized_feature_select_plate.csv.gz
        │   └── BR00117035_normalized_negcon.csv.gz
        └── BR00117036

are actually directories of single parquet files (similar to the ones expected in workspace_dl https://github.com/broadinstitute/cellpainting-gallery/blob/main/folder_structure.md#profiles-folder-structure-1). Is this expected or does folder_structure.md need updating?

Many thanks for any help!

niranjchandrasekaran commented 1 year ago

Hi Lewis, I am tagging @shntnu who should be able to tell you what our current plans are.

lewismervin1 commented 1 year ago

Thanks @niranjchandrasekaran and @shntnu. The reason we ask, is because we can only access one of the (full plate) parquet files at the moment, and are missing the _feature_select_negcon_plate.csv.gz, _normalized_feature_select_plate.csv.gz etc. files.

niranjchandrasekaran commented 1 year ago

Hi Lewis, thanks for the additional context. Generating those additional files will require data alignment and normalization across all the sources, which we are still working on. Once we settle on the approach that we would take, we will either have per-plate parquet versions of those files or a single parquet file with all the plates (to be decided).

lewismervin1 commented 8 months ago

Hi @niranjchandrasekaran, we were wondering if there is a decision for how these files should look and if this issue should be closed? Many thanks for your help.

shntnu commented 8 months ago

@lewismervin1 thanks for checking in. We're still working on a data processing pipeline for getting all the JUMP data aligned.

we will either have per-plate parquet versions of those files or a single parquet file with all the plates (to be decided).

We will eventually provide per-plate parquet but the first few versions of the aligned data will be either a single PyArrow Dataset.

Once we've completed implementing our new data validation system + schema (in progress here https://github.com/broadinstitute/cpg), we will distribute them as per-plate parquets (very likely using the same folder structure)

shntnu commented 4 months ago

We will eventually provide per-plate parquet but the first few versions of the aligned data will be either a single PyArrow Dataset.

@lewismervin1 This is now available (the PR is still open, but you can peek in already)

https://github.com/jump-cellpainting/datasets/pull/99