Closed EasternCaveMan closed 1 year ago
Hi @atabaigi,
I see the point of this not working. In line 26 of ecfp.py I check if the dataset has been labeled as a molecular dataset, which is not the case if you create it the way, you do.
I'd suggest to use
from datasail.reader.read_molecules import read_molecule_data
from datasail.cluster.ecfp import run_ecfp
mydata = read_molecule_data('output2500.tsv')
names, cluster_map, sim_matrix = run_ecfp(mydata)
Please let me know if it works for you or if you encountered any other issues.
Best, Roman
I followed the structure you mentioned :
mydata = read_molecule_data('output2500.tsv')
names, cluster_map, sim_matrix = run_ecfp(mydata)
I got this error : output
TypeError: read_molecule_data() missing 8 required positional arguments: 'weights', 'sim', 'dist', 'max_sim', 'max_dist', 'id_map', 'inter', and 'index'
which doesn't make sense to me, because on the package website (https://datasail.readthedocs.io/en/latest/workflow/input.html) it is said that for SMILES: A TSV file with the moleculeβs ID in the first column and a SMILES string in the second column. Further columns will be ignored.
I dont have the following arguments: 'weights', 'sim', 'dist'. do you mean by 'weights' the molecular weights? and about 'sim', and 'dist' do I have to by myself write functions to calculate them? if yes what should be the format of 'sim' and 'dis' files? I didn't understand this part too --> 'dist: Distance file or metric'. instead of Distance file i can use metric, what do you mean by metric? does it mean I can use the key function to calculate the Distance file? I really appreciate your help, I am a master's student at Saarland University, and your package was introduced to me by my supervisor, it has all functions I need to do my master's seminar.
Hi @atabaigi,
First of all, I'm happy to help you with any questions you have. Second, the package is still actively developed, therefore, it might be a little buggy/confusing sometimes, and the documentation is far from complete. Third, I'm a Ph.D. student in Saarbruecken as well and know Michael ;-)
Answering your question: You can set every argument to None, which should solve the issues. The reason for this "bug" is that DataSAIL was first implemented as a command-line tool and it was not intended to be "hacked" the way you do. But that's fine. I'm currently working on making DataSAIL conda-installable and therefore will address these things.
In general, maybe this docu-page might be helpful for you. With
dist: Distance file or metric
I mean you either provide a CSV-file storing a similarity-matrix or distance-matrix of your data points of a metric's name to be used for this. For molecular data, only "ECFP" is available.
Best, Roman
Hi @Old-Shatterhand , I hope that you are doing well, Thank you so much for your kind words. I have followed the structure which you mentioned: Input : input2500.tsv
molecule_chembl_id canonical_smiles IC50
CHEMBL68920 Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1 41.0
CHEMBL68920 Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1 300.0
CHEMBL68920 Cc1cc(C)c(/C=C2\C(=O)Nc3ncnc(Nc4ccc(F)c(Cl)c4)c32)[nH]1 7820.0
code
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from datasail.reader.read_molecules import read_molecule_data
from datasail.cluster.ecfp import run_ecfp
mydata = read_molecule_data('input2500.tsv', weights=None, sim=None, dist=None, max_sim=None, max_dist=None, id_map=None, inter=None,index=None)
names, cluster_map, sim_matrix = run_ecfp(mydata)
here is the following error:
AttributeError: 'tuple' object has no attribute 'type'
Hi @atabaigi,
I'm sorry to hear that you still encounter issues. Could you please provide the full stack trace of the output to that I can see in which line the code broke? Furthermore, there has been a lot going on in the code of DataSAIL. Maybe, you can update your version to the latest state of the repo or install it from conda: https://datasail.readthedocs.io/en/latest/index.html
Best, Roman
Hi @Old-Shatterhand , Thanks for your quick response. Yes, actually, I have realized that there have been a lot of changes since last week.
here is the full stack trace of the output:
Traceback (most recent call last):
File "/Users/vahidatabaigi/anaconda3/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-09fea9123334>", line 1, in <module>
runfile('/Users/vahidatabaigi/PycharmProjects/test.py', wdir='/Users/vahidatabaigi/PycharmProjects')
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/vahidatabaigi/PycharmProjects/test.py", line 7, in <module>
names, cluster_map, sim_matrix = run_ecfp(mydata)
File "/Users/vahidatabaigi/anaconda3/lib/python3.10/site-packages/datasail/cluster/ecfp.py", line 26, in run_ecfp
if dataset.type != "M":
AttributeError: 'tuple' object has no attribute 'type'
The issue is that read_molecule_data
returns a tuple of the dataset
and a list
of interactions (which is not interesting in your case). If you change the code to
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from datasail.reader.read_molecules import read_molecule_data
from datasail.cluster.ecfp import run_ecfp
mydata, _ = read_molecule_data('input2500.tsv', weights=None, sim=None, dist=None, max_sim=None, max_dist=None, id_map=None, inter=None,index=None)
names, cluster_map, sim_matrix = run_ecfp(mydata)
it should work.
thank you so much βΊοΈ π , it worked
Hi @Old-Shatterhand, I hope you are doing well, I had a problem with the installation of the package. I tried to install it through CLI, but every time I got this error:
(base) vahidatabaigi@vahids-MacBook-Pro ~ % conda install -c kalininalab -c conda-forge -c mosek datasail
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- datasail
Current channels:
- https://conda.anaconda.org/kalininalab/osx-arm64
- https://conda.anaconda.org/kalininalab/noarch
- https://conda.anaconda.org/conda-forge/osx-arm64
- https://conda.anaconda.org/conda-forge/noarch
- https://conda.anaconda.org/mosek/osx-arm64
- https://conda.anaconda.org/mosek/noarch
- https://repo.anaconda.com/pkgs/main/osx-arm64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/osx-arm64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
to solve this problem I downloaded the package and went to the directory in which the file setup.py exist, and ran the following code:
pip install .
so that I was able to install and import the package in Pycharm. if you remember last time I was able to run the following code:
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from datasail.reader.read_molecules import read_molecule_data
from datasail.cluster.ecfp import run_ecfp
mydata, _ = read_molecule_data('input2500.tsv', weights=None, sim=None, dist=None, max_sim=None, max_dist=None, id_map=None, inter=None,index=None)
names, cluster_map, sim_matrix = run_ecfp(mydata)
In the next step, I want to split the data based on clusters into training and test sets, I have tried to import a function from the package to do this, but I was not able to find one. however, in the package's document, you mentioned how to split the data through CLI: I did use the following code to split data in CLI
python sail --e-type M --e-data input2500.tsv --e-sim ecfp --output PycharmProjects/CADDSeminar_2023/notebooks/T001_Scaffold-based-data-split --technique CCS --splits 0.8 0.2
but I got this error:
(base) vahidatabaigi@vahids-MacBook-Pro ~ % python sail --e-type M --e-data input2500.tsv --e-sim ecfp --output PycharmProjects/CADDSeminar_2023/notebooks/T001_Scaffold-based-data-split --technique CCS --splits 0.8 0.2
python: can't open file '/Users/vahidatabaigi/sail': [Errno 2] No such file or directory
I am sorry if the message is too long, but I thought it would be better to explain everything to give all information. have a good day
Hi @atabaigi,
I appreciate your patience with DataSAIL. Thank you for giving me that much information about the problem. That helps a lot.
Regarding the installation: Please try
conda install -c kalininalab -c conda-forge -c mosek -c bioconda datasail
Regarding the execution: When running the CLI interface of datasail
, you don't need to prepend python
. Please try
datasail -h
Furthermore, I advise installing datasail
into a separate conda environment.
Regarding importing stuff form datasail
: I've tested the import statement in a conda environment with datasail
installed and it worked fine. Here'r what I tried:
import datasail
form datasail import reader
from datasail.reader.read_molecules import read_molecular_data
I hope I could help to fix your problems. If not, please let me know.
Best, Roman
Hi @Old-Shatterhand, sorry for bothering you again, I have tried to install the dataSAIL as you said. In order to minimize the potential for conflicts with other packages, I uninstalled Anaconda and installed Miniconda instead. Additionally, I created a separate conda environment specifically for dataSAIL. Despite these efforts, I encountered the same error as before. I even tried installing dataSAIL on different computers with various operating systems, but unfortunately, the issue persisted. The error I received on Windows was identical to the one I encountered on macOS.
(base) C:\Users\Ali>conda install -c kalininalab -c conda-forge -c mosek -c bioconda datasail
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- datasail
Current channels:
- https://conda.anaconda.org/kalininalab/win-64
- https://conda.anaconda.org/kalininalab/noarch
- https://conda.anaconda.org/conda-forge/win-64
- https://conda.anaconda.org/conda-forge/noarch
- https://conda.anaconda.org/mosek/win-64
- https://conda.anaconda.org/mosek/noarch
- https://conda.anaconda.org/bioconda/win-64
- https://conda.anaconda.org/bioconda/noarch
- https://repo.anaconda.com/pkgs/main/win-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/win-64
- https://repo.anaconda.com/pkgs/r/noarch
- https://repo.anaconda.com/pkgs/msys2/win-64
- https://repo.anaconda.com/pkgs/msys2/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org/
and use the search bar at the top of the page.
I believe there may have been a misunderstanding in our previous conversation. As I mentioned before, I am able to install your package locally by executing the setup.py file using the following code:
(base) vahidatabaigi@vahids-MBP DataSAIL-main % ls
LICENSE build.sh datasail environment.yml pytest.ini tests
README.md conftest.py docs meta.yaml setup.py
(base) vahidatabaigi@vahids-MBP DataSAIL-main % pip install .
Processing /Users/vahidatabaigi/Downloads/DataSAIL-main
Preparing metadata (setup.py) ... done
Building wheels for collected packages: DataSAIL
Building wheel for DataSAIL (setup.py) ... done
Created wheel for DataSAIL: filename=DataSAIL-0.0.10-py3-none-any.whl size=71719 sha256=db087a38e28034f85139e6b8bd1c51ad8d6e804403c526ec690c5ec62eab68e8
Stored in directory: /Users/vahidatabaigi/Library/Caches/pip/wheels/4b/d8/f8/9f443bb8564b7e70617c7eb219ad154fe56845435678f3b890
Successfully built DataSAIL
Installing collected packages: DataSAIL
Attempting uninstall: DataSAIL
Found existing installation: DataSAIL 0.0.10
Uninstalling DataSAIL-0.0.10:
Successfully uninstalled DataSAIL-0.0.10
Successfully installed DataSAIL-0.0.10
However, I encountered an error when attempting to split my data. The specific error message I received is:
(base) vahidatabaigi@vahids-MBP DataSAIL-main % sail --e-type M --e-data input2500.tsv --e-sim ecfp --output /Users/vahidatabaigi/PycharmProjects/CADDSeminar_2023/notebooks --technique CCS --splits 0.8 0.2
zsh: command not found: sail
However, although I am able to import dataSAIL successfully in PyCharm, I am unsure which specific function or module I should import in order to correctly split the data after clustering. Here is the code I have been using for clustering my data. Additionally, I would like to have the ability to handle the threshold for clustering. but I don't know how to do that:
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from datasail.reader.read_molecules import read_molecule_data
from datasail.cluster.ecfp import run_ecfp
mydata, _ = read_molecule_data('input2500.tsv', weights=None, sim=None, dist=None, max_sim=None, max_dist=None, id_map=None, inter=None,index=None)
names, cluster_map, sim_matrix = run_ecfp(mydata)
kind regards, Vahid
Hi @atabaigi,
no worries, I'm happy to fix any bug/problem you find. Let's go through this step by step:
Apparently, this does not work for me either. I guess the conda dependency-solver is not powerful enough to build the environment. Here, I'd suggest using mamba instead. Even though, they say, to not install it in conda's base environment, I had no issues with this so far. Then, you can simply run
mamba install -c kalininalab -c conda-forge -c mosek -c bioconda datasail
pip install grakel
This is not a supported way of installing DataSAIL so far. It technically runs without errors, but no dependencies are installed. (And as you've encountered, the command line tool is also not available). I'll maybe add this functionality later, but so far, the setup.py
is just an artifact, left over from the very early stages of the project.
There is no real clustering happening in the run_ecfp
method. If out check it out (code and docu), it comprises three steps
The aggregation of data points occurs for molecules with the same Murcko scaffold. DataSAIL then just computes a similarity matrix for all scaffolds. There is no threshold to be controlled.
In order to split the data, you have to run run_solver
(code). But this will not work! As you might have 2500 clusters (judging from your input file's name), the constraint optimization problem becomes way too big (I tested that and it exceeds 1TB of CPU memory). The easiest way is to use DataSAIL in the way, it has been designed for. Just call
from datasail.sail import datasail
_, splits, _ = datasail(techniques=["CCSe"], e_type="M", e_data="input2500.tsv")
The reason why you cannot just run the splitting after the ECFP clustering is that ECFP returns 2500 clusters in your case (judging from the input file's name). Internally, DataSAIL runs affinity propagation (sklearn-clustering) on the ECFP clusters to actually introduce clusters. This method cannot be easily extracted from DataSAIL as you did with reading the molecules and clustering them by yourself.
I hope, I could help you with this. If not, or if you encounter any other issues, please let me know.
Best, Roman
Hi @Old-Shatterhand, I install the mamba and then run your script. Thank you for your patience with me. Here is the error I encountered. I think it is the same as the previous error
(py310_caddseminar2023) vahidatabaigi@vahids-MBP CADDSeminar_2023 % conda install mamba
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 23.3.1
latest version: 23.5.0
Please update conda by running
$ conda update -n base -c defaults conda
Or to minimize the number of packages updated during conda update use
conda install conda=23.5.0
## Package Plan ##
environment location: /Users/vahidatabaigi/miniconda3/envs/py310_caddseminar2023
added / updated specs:
- mamba
The following packages will be downloaded:
package | build
---------------------------|-----------------
c-ares-1.19.0 | h80987f9_0 104 KB
ca-certificates-2023.05.30 | hca03da5_0 121 KB
certifi-2023.5.7 | py310hca03da5_0 153 KB
conda-22.11.1 | py310hca03da5_5 957 KB
conda-package-handling-2.1.0| py310hca03da5_0 270 KB
conda-package-streaming-0.8.0| py310hca03da5_0 29 KB
cryptography-38.0.4 | py310hfc83b78_0 1.0 MB conda-forge
fmt-9.1.0 | h48ca7d4_0 179 KB
krb5-1.20.1 | h48293ea_0 1.2 MB
libarchive-3.6.2 | h82b9b87_1 781 KB conda-forge
libcurl-8.1.2 | h912dcd9_0 338 KB conda-forge
libedit-3.1.20221030 | h80987f9_0 154 KB
libev-4.33 | h1a28f6b_1 104 KB
libmamba-1.4.2 | h7d1d596_0 1.1 MB conda-forge
libmambapy-1.4.2 | py310h34b6e76_0 216 KB conda-forge
libnghttp2-1.52.0 | hae82a92_0 551 KB conda-forge
libsolv-0.7.24 | hb5ab8b9_0 377 KB conda-forge
libssh2-1.11.0 | h7a5bd25_0 250 KB conda-forge
lz4-c-1.9.4 | h313beb8_0 155 KB
lzo-2.10 | h1a28f6b_2 129 KB
mamba-1.4.2 | py310ha5d4528_0 50 KB conda-forge
pybind11-abi-4 | hd3eb1b0_1 14 KB
reproc-14.2.4 | hc377ac9_1 27 KB
reproc-cpp-14.2.4 | hc377ac9_1 20 KB
ruamel.yaml.clib-0.2.7 | py310h8e9501a_1 107 KB conda-forge
yaml-cpp-0.7.0 | hc377ac9_1 427 KB
------------------------------------------------------------
Total: 8.7 MB
The following NEW packages will be INSTALLED:
c-ares pkgs/main/osx-arm64::c-ares-1.19.0-h80987f9_0
conda pkgs/main/osx-arm64::conda-22.11.1-py310hca03da5_5
conda-package-han~ pkgs/main/osx-arm64::conda-package-handling-2.1.0-py310hca03da5_0
conda-package-str~ pkgs/main/osx-arm64::conda-package-streaming-0.8.0-py310hca03da5_0
cryptography conda-forge/osx-arm64::cryptography-38.0.4-py310hfc83b78_0
fmt pkgs/main/osx-arm64::fmt-9.1.0-h48ca7d4_0
krb5 pkgs/main/osx-arm64::krb5-1.20.1-h48293ea_0
libarchive conda-forge/osx-arm64::libarchive-3.6.2-h82b9b87_1
libcurl conda-forge/osx-arm64::libcurl-8.1.2-h912dcd9_0
libedit pkgs/main/osx-arm64::libedit-3.1.20221030-h80987f9_0
libev pkgs/main/osx-arm64::libev-4.33-h1a28f6b_1
libmamba conda-forge/osx-arm64::libmamba-1.4.2-h7d1d596_0
libmambapy conda-forge/osx-arm64::libmambapy-1.4.2-py310h34b6e76_0
libnghttp2 conda-forge/osx-arm64::libnghttp2-1.52.0-hae82a92_0
libsolv conda-forge/osx-arm64::libsolv-0.7.24-hb5ab8b9_0
libssh2 conda-forge/osx-arm64::libssh2-1.11.0-h7a5bd25_0
lz4-c pkgs/main/osx-arm64::lz4-c-1.9.4-h313beb8_0
lzo pkgs/main/osx-arm64::lzo-2.10-h1a28f6b_2
mamba conda-forge/osx-arm64::mamba-1.4.2-py310ha5d4528_0
pluggy pkgs/main/osx-arm64::pluggy-1.0.0-py310hca03da5_1
pybind11-abi pkgs/main/noarch::pybind11-abi-4-hd3eb1b0_1
pycosat pkgs/main/osx-arm64::pycosat-0.6.4-py310h1a28f6b_0
pyopenssl pkgs/main/osx-arm64::pyopenssl-23.0.0-py310hca03da5_0
reproc pkgs/main/osx-arm64::reproc-14.2.4-hc377ac9_1
reproc-cpp pkgs/main/osx-arm64::reproc-cpp-14.2.4-hc377ac9_1
ruamel.yaml pkgs/main/osx-arm64::ruamel.yaml-0.17.21-py310h1a28f6b_0
ruamel.yaml.clib conda-forge/osx-arm64::ruamel.yaml.clib-0.2.7-py310h8e9501a_1
toolz pkgs/main/osx-arm64::toolz-0.12.0-py310hca03da5_0
tqdm pkgs/main/osx-arm64::tqdm-4.65.0-py310h33ce5c2_0
yaml-cpp pkgs/main/osx-arm64::yaml-cpp-0.7.0-hc377ac9_1
zstandard pkgs/main/osx-arm64::zstandard-0.19.0-py310h80987f9_0
The following packages will be UPDATED:
ca-certificates conda-forge::ca-certificates-2023.5.7~ --> pkgs/main::ca-certificates-2023.05.30-hca03da5_0
The following packages will be SUPERSEDED by a higher-priority channel:
certifi conda-forge/noarch::certifi-2023.5.7-~ --> pkgs/main/osx-arm64::certifi-2023.5.7-py310hca03da5_0
Proceed ([y]/n)? y
Downloading and Extracting Packages
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(py310_caddseminar2023) vahidatabaigi@vahids-MBP CADDSeminar_2023 % which mamba
/Users/vahidatabaigi/miniconda3/envs/py310_caddseminar2023/bin/mamba
(py310_caddseminar2023) vahidatabaigi@vahids-MBP CADDSeminar_2023 % mamba install -c kalininalab -c conda-forge -c mosek -c bioconda datasail
__ __ __ __
/ \ / \ / \ / \
/ \/ \/ \/ \
βββββββββββββββ/ /ββ/ /ββ/ /ββ/ /ββββββββββββββββββββββββ
/ / \ / \ / \ / \ \____
/ / \_/ \_/ \_/ \ o \__,
/ _/ \_____/ `
|/
ββββ ββββ ββββββ ββββ βββββββββββ ββββββ
βββββ ββββββββββββββββββ βββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββ
βββ βββ ββββββ ββββββ βββ ββββββββββββββ βββ
βββ ββββββ ββββββ ββββββββββ βββ βββ
mamba (1.4.2) supported by @QuantStack
GitHub: https://github.com/mamba-org/mamba
Twitter: https://twitter.com/QuantStack
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Looking for: ['datasail']
mosek/osx-arm64 4.2kB @ 8.7kB/s 0.5s
pkgs/r/osx-arm64 118.0 B @ 167.0 B/s 0.2s
kalininalab/noarch 135.0 B @ 173.0 B/s 0.8s
kalininalab/osx-arm64 136.0 B @ 142.0 B/s 1.0s
mosek/noarch 135.0 B @ 116.0 B/s 0.2s
pkgs/main/noarch 837.8kB @ 148.5kB/s 4.5s
pkgs/r/noarch 1.3MB @ 189.3kB/s 6.1s
bioconda/osx-arm64 129.0 B @ 17.0 B/s 0.6s
pkgs/main/osx-arm64 1.7MB @ 140.7kB/s 6.2s
conda-forge/osx-arm64 6.5MB @ 429.4kB/s 15.3s
bioconda/noarch 4.3MB @ 279.4kB/s 14.7s
conda-forge/noarch 12.5MB @ 667.8kB/s 19.1s
Pinned packages:
- python 3.10.*
Could not solve for environment specs
The following package could not be installed
ββ datasail does not exist (perhaps a typo or a missing channel).
(py310_caddseminar2023) vahidatabaigi@vahids-MBP CADDSeminar_2023 % conda search -c conda-forge datasail
Loading channels: done
No match found for: datasail. Search: *datasail*
PackagesNotFoundError: The following packages are not available from current channels:
- datasail
Current channels:
- https://conda.anaconda.org/conda-forge/osx-arm64
- https://conda.anaconda.org/conda-forge/noarch
- https://repo.anaconda.com/pkgs/main/osx-arm64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/osx-arm64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
best regards Vahid
Hi @atabaigi,
which OS are you using? Due to its dependencies, it's not available for Windows. I just tested
mamba install -c kalininalab -c conda-forge -c mosek -c bioconda datasail
on my Linux computer and it works fine. TBH, I haven't tested the installation on OSX but will include this in the GitHub CI/CD.
Best, Roman
Hi @Old-Shatterhand, I am using macOS ventura 13.4, chip M1
Ok, I didn't expect that. For better tractability, I opened another issue as the discussion is now drifting away from the initial topic. As I assume the original question regarding dataset formatting is solved, I close this issue and will work on the OSX installability in Issue #3. If you think, your initial question(s) regarding the dataset format is not answered sufficiently, please reopen this issue and I will help you with this. Otherwise, see you in issue #3.
I tried to split the data based one CCSe but it give the emtpy dict by name splits
from datasail.sail import datasail
_, splits, _ = datasail(techniques=["CCSe"], e_type="M", e_data="input2500.tsv")
error
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/Users/vahidatabaigi/PycharmProjects'])
PyDev console: starting.
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:01:19) [Clang 14.0.6 ] on darwin
runfile('/Users/vahidatabaigi/PycharmProjects/datasail_package.py', wdir='/Users/vahidatabaigi/PycharmProjects')
/Users/vahidatabaigi/miniconda3/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
No duplicates found.
1201 / 1290Converged after 62 iterations.
Converged after 26 iterations.
===============================================================================
CVXPY
v1.3.1
===============================================================================
(CVXPY) Jun 09 02:45:13 PM: Your problem has 87 variables, 10 constraints, and 0 parameters.
(CVXPY) Jun 09 02:45:13 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Jun 09 02:45:13 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Jun 09 02:45:13 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
2023-06-09 14:45:13,010 Splitting failed for CCS, try to increase the timelimit or the epsilon value.
No duplicates found.
1201 / 1290
I added the parameter epsilon = 0.05 and increased it by 0.2, but I got the same error, I did not find parameter timelimit
Hi @atabaigi,
that is like an issue with the data (they might be too similar). Is is possible to upload them here (if they are not proprietary)?
Anyways, can you please paste the output you get, when running the same command with verbose="I"
to get more logging messages?
BTW: The field for timeout is max_sec
but that wouldn't help you here, as CvXPY is not even able to compile the program.
Best, Roman
here is the my data
with parameter verbose="I"
PyDev console: starting.
Python 3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:01:19) [Clang 14.0.6 ] on darwin
runfile('/Users/vahidatabaigi/PycharmProjects/datasail_package.py', wdir='/Users/vahidatabaigi/PycharmProjects')
/Users/vahidatabaigi/miniconda3/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
2023-06-09 15:08:56,062 Validating arguments
2023-06-09 15:08:56,062 Read data
No duplicates found.
2023-06-09 15:08:56,314 Cluster first set of entities.
2023-06-09 15:08:56,314 Start ECFP clustering
2023-06-09 15:08:58,119 Reduced 1764 molecules to 1290
2023-06-09 15:08:58,119 Compute Tanimoto Coefficients
1201 / 12902023-06-09 15:08:58,273 Cluster 1290 items based on similarities
Converged after 58 iterations.
2023-06-09 15:08:59,958 Reduced number of clusters to 158.
2023-06-09 15:08:59,958 Cluster 158 items based on similarities
Converged after 26 iterations.
2023-06-09 15:08:59,980 Reduced number of clusters to 29.
2023-06-09 15:08:59,980 Split data
2023-06-09 15:08:59,980 Define optimization problem
2023-06-09 15:08:59,980 CCSe
2023-06-09 15:08:59,980 Clustering 29 clusters into 3 splits.
2023-06-09 15:08:59,982 Start solving with MOSEK
2023-06-09 15:08:59,982 The problem has 87 variables and 2558 constraints.
===============================================================================
CVXPY
v1.3.1
===============================================================================
(CVXPY) Jun 09 03:08:59 PM: Your problem has 87 variables, 10 constraints, and 0 parameters.
(CVXPY) Jun 09 03:08:59 PM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Jun 09 03:08:59 PM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Jun 09 03:08:59 PM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
2023-06-09 15:08:59,983 Splitting failed for CCS, try to increase the timelimit or the epsilon value.
2023-06-09 15:08:59,984 Store results
2023-06-09 15:08:59,984 BQP splitting finished and results stored.
2023-06-09 15:08:59,984 Total runtime: 3.92160s
but still is the empty dict
Hi @atabaigi,
sorry for this problem. It's actually not data-related. It's because of the choice of the solving algorithm. There are two options to fix this:
Use SCIP:
from datasail.sail import datasail
splits, _, _ = datasail(techniques=["CCSe"], e_type="M", e_data="input2500.tsv", solver="SCIP")
Please note, in this code, I changes the position of splits
!
I hope this finally solves your problem. If not, please ask further questions.
Please note that the output only contains split assignment for one CHEMBL-ID per SMILES string. As your dataset contains duplicate SMILES, DataSAIL removes them first.
I thought they would appear in the output, but it seems like this is only true for the CLI version. I will implement this for package usage ASAP.
Hi@Old-Shatterhand,
I apologize for the delayed response. I initially thought that the issue might be related to my operating system, so I opted to install Ubuntu Linux arm64. However, despite following your suggestion to split the data, I am still encountering the same error. Additionally, I attempted to resolve the issue by changing the solver to Mosek, but unfortunately, it did not make any difference.
As for my endeavor to install the package on Ubuntu, I am still facing the same error.
(base) vahidata@ubuntu:~$ conda env list
# conda environments:
#
base * /home/vahidata/miniconda3
(base) vahidata@ubuntu:~$ conda create -n sail -c conda-forge -c kalininalab -c mosek -c bioconda datasail
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
- datasail
Current channels:
- https://conda.anaconda.org/conda-forge/linux-aarch64
- https://conda.anaconda.org/conda-forge/noarch
- https://conda.anaconda.org/kalininalab/linux-aarch64
- https://conda.anaconda.org/kalininalab/noarch
- https://conda.anaconda.org/mosek/linux-aarch64
- https://conda.anaconda.org/mosek/noarch
- https://conda.anaconda.org/bioconda/linux-aarch64
- https://conda.anaconda.org/bioconda/noarch
- https://repo.anaconda.com/pkgs/main/linux-aarch64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-aarch64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
(base) vahidata@ubuntu:~$ conda install -c conda-forge -c kalininalab -c mosek -c bioconda datasail
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- datasail
Current channels:
- https://conda.anaconda.org/conda-forge/linux-aarch64
- https://conda.anaconda.org/conda-forge/noarch
- https://conda.anaconda.org/kalininalab/linux-aarch64
- https://conda.anaconda.org/kalininalab/noarch
- https://conda.anaconda.org/mosek/linux-aarch64
- https://conda.anaconda.org/mosek/noarch
- https://conda.anaconda.org/bioconda/linux-aarch64
- https://conda.anaconda.org/bioconda/noarch
- https://repo.anaconda.com/pkgs/main/linux-aarch64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-aarch64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
(base) vahidata@ubuntu:~$ conda install -c kalininalab -c conda-forge -c mosek datasail
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- datasail
Current channels:
- https://conda.anaconda.org/kalininalab/linux-aarch64
- https://conda.anaconda.org/kalininalab/noarch
- https://conda.anaconda.org/conda-forge/linux-aarch64
- https://conda.anaconda.org/conda-forge/noarch
- https://conda.anaconda.org/mosek/linux-aarch64
- https://conda.anaconda.org/mosek/noarch
- https://repo.anaconda.com/pkgs/main/linux-aarch64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-aarch64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
I attempted to resolve the issue by using the following command with Mamba( as you recommended), but unfortunately, I encountered the same error as I encountered on Mac and Windows systems.
Hi @atabaigi,
I'm sorry you still have issues installing DataSAIL from the kalininalab
conda-channel. To be honest, I don't quite understand why you do the steps above.
As sorted out here, installing DataSAIL using conda directly does not work for me either. I suggested using mamba
.
As sorted out throughout this issue and issue #3, DataSAIL cannot be installed and executed on Windows OS, but perfectly on MacOS and Linux. You just have to use mamba
for this. I'll copy the instruction I gave in #3 here and put a bit more explanation to it:
base
environment of a Linux OS or MacOS machine!mamba
to the base
environment!
Therefore, run conda install -c conda-forge mamba
in your base
environment.
Further information can be found in the mamba documentation.conda create -n <env_name> python=3.10
conda activate <env_name>
mamba install -c mosek -c conda-forge -c bioconda -y numpy pandas networkx matplotlib pytest setuptools pyscipopt foldseek mmseqs2 cd-hit mash tmalign cvxpy pytest-cov rdkit pytest-cases scikit-learn
pip install grakel
It is important to use python=3.10
as DataSAIL has no Python 3.11 builds yet. This is actually the issue with your last try to install DataSAIL using mamba
I'm pretty sure if you follow these steps rigorously, you will have installed DataSAIL v0.0.10 successfully and can use it as a Python package and as a CLI tool.
If you're still struggling with the installation, Michael or I can help you in person (he can give you some contact details of me).
Best, Roman
Hi @Old-Shatterhand , Thanks for your message, I followed the steps, but it didn't work on Linux and macOS on ARM architecture. However, it works with Linux and macOS on ADM architecture. For completeness, I added one more step to successfully install dataSAIL
conda create -n <env_name> python=3.10
conda activate <env_name>
conda install -c conda-forge mamba
mamba install -c mosek -c conda-forge -c bioconda -y numpy pandas networkx matplotlib pytest setuptools pyscipopt foldseek mmseqs2 cd-hit mash tmalign cvxpy pytest-cov rdkit pytest-cases scikit-learn
pip install grakel
mamba install -c kalininalab -c conda-forge -c mosek -c bioconda datasail
Best regards Vahid
Something goes wrong using the data and code below. According to the docs the format should be right for a molecular dataset?
Code
Data
Output