jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
155 stars 16 forks source link

Add SMILES to compounds.csv.gz #103

Closed srijitseal closed 6 months ago

srijitseal commented 6 months ago

This file adds a standardized SMILES column and the first 14 characters of the InChI key (representing the connectivity) to the compounds.csv.gz We show that six compounds have dual entires, often different ionization states.

shntnu commented 6 months ago

Over to @afermg to review

@srijitseal I noticed a few things that would need fixing, but let's wait for @afermg 's full review

afermg commented 6 months ago

@srijitseal Could you please point me to the code? I think it can go to monorepo, but we must first package it as a small library for reproducibility. The most important things at the moment is pinning the dependencies. Let me know if I can be of help for that.

srijitseal commented 6 months ago

https://github.com/jump-cellpainting/jump-cellpainting/pull/156 You can find the file here! It's almost the same but I removed the loop to save time after consulting with Andreas, I think the efficiency to standardize now is 6 times faster for less loss of information and tautomers will always remain a problem no matter which package we use or how many loops we run for finding the best tautomer.

On Mon, Mar 18, 2024 at 11:56 AM Alán F. Muñoz @.***> wrote:

@srijitseal https://github.com/srijitseal Could you please point me to the code? I think it can go to monorepo, but we must first package it as a small library for reproducibility. The most important things at the moment is pinning the dependencies. Let me know if I can be of help for that.

— Reply to this email directly, view it on GitHub https://github.com/jump-cellpainting/datasets/pull/103#issuecomment-2004307150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN34ZTZHB5STEHDQAOEG5DLYY4FBBAVCNFSM6AAAAABE2ZNC5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBUGMYDOMJVGA . You are receiving this because you were mentioned.Message ID: @.***>

afermg commented 6 months ago

The comment on pinning dependencies is only for reproducibility, I am not knowledgeable enough about chemoinformatics to comment about the usage of those library. I do need the dependency versions and one test to ensure that packaging still works.

Sorry if it seems like I'm asking for a lot, I just want to ensure that the code that we put in the monorepo runs correctly so it can be reliably referred to in the future. Also, because it is to be a tiny tool, we need to have it as a script/module, not a notebook. I can do the transformation though, as long as I can reproduce the environment in which you produced the data.

shntnu commented 6 months ago

I overrode this PR for now by using the SMILES generated when the JCP IDs were created Details: https://github.com/jump-cellpainting/datasets-private/pull/88