Closed afermg closed 1 month ago
@afermg
You have this:
Dataset orf contains 10 fewer entries
See
https://github.com/jump-cellpainting/datasets/issues/104#issuecomment-2035695119
This means that we are dropping JCP IDs that are present in the pilot datasets, right? Dropping compounds seems fine; I'm only worried about ORFs (because all 10 ORFs are in pilots)
@afermg
You have this:
Dataset orf contains 10 fewer entries
See
This means that we are dropping JCP IDs that are present in the pilot datasets, right? Dropping compounds seems fine; I'm only worried about ORFs (because all 10 ORFs are in pilots)
I'm pretty sure we had this discussion with both yourself and @niranjchandrasekaran and the final conclusion was to drop them, as they break most automated means of processing JUMP data. Please do correct me if I am mistaken though.
You are right. Please proceed.
-Shantanu
On Thu, May 16, 2024 at 2:50 PM Alán F. Muñoz @.***> wrote:
@afermg https://github.com/afermg
You have this:
Dataset orf contains 10 fewer entries
See
104 (comment)
https://github.com/jump-cellpainting/datasets/issues/104#issuecomment-2035695119
This means that we are dropping JCP IDs that are present in the pilot datasets, right? Dropping compounds seems fine; I'm only worried about ORFs (because all 10 ORFs are in pilots)
I'm pretty sure we had this discussion with both yourself and @niranjchandrasekaran https://github.com/niranjchandrasekaran and the final conclusion was to drop them, as they break most automated means of processing JUMP data. Please do correct me if I am mistaken though.
— Reply to this email directly, view it on GitHub https://github.com/jump-cellpainting/datasets/pull/109#issuecomment-2115967876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJHQPHZNC5SNY5QWD5YC5TZCT5XLAVCNFSM6AAAAABHCUX6BKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVHE3DOOBXGY . You are receiving this because your review was requested.Message ID: @.***>
After our discussions related to #104 and this private issue, we came to the conclusion that we should drop the samples that are missing in {crispr, orf, compounds}.csv.gz. These were breaking automated workflows. I also include the crispr dataset in case there is a problem with compression down the line (its also 64Kb, so doesn't make much of a difference).
In a similar topic, it's worth thinking about migrating data to parquets at some point, which would enable lazy-loading metadata fields without downloading entire files. Let me know if that's a possibility to consider so I can open a related issue.
The code to generate these new files is shown below, and its dependencies are specified here (alongside a poetry.lock file), but the important ones are:
Dependencies
Code