kalininalab / DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.
https://datasail.readthedocs.io
MIT License
18 stars 1 forks source link

What to do about samples that are "not selected" to a C2 split in inter.tsv? #29

Open imempress opened 2 weeks ago

imempress commented 2 weeks ago

First of all, thank you for developing DataSAIL. It is very important work and I appreciate your effort in developing and maintaining this tool.

What is your input?

I am hoping to use DataSAIL to do a 2D-cluster based split on proteins (AA sequences) and associated ligands (SMILES). I was able to install all of the relevant dependencies (with a bit of troubleshooting; see Additional Context below). Based on the docs and the help message, I constructed the command with the following params:

datasail \
    -o 2d_splits \
    -t C2 \
    -s 0.7 0.2 0.1 \
    -n train val test \
    -i interactions.csv \
    --e-data proteins.csv \
    --e-type P \
    --f-data mols.csv \
    --f-type M

What do you observe? The datasail program runs and generates output. However, I did notice a warning message related to sub-optimal results from clustering, such as:

2024-11-18 20:02:38,370 mmseqs cannot optimally cluster the data. The minimal number of clusters is 1512.
/fsx/home/imiller/miniforge3/envs/datasail/lib/python3.10/site-packages/cvxpy/problems/problem.py:158: UserWarning: Objective contains too many subexpressions. Consider vectorizing your CVXPY code to speed up compilation.
  warnings.warn("Objective contains too many subexpressions. "
/fsx/home/imiller/miniforge3/envs/datasail/lib/python3.10/site-packages/cvxpy/problems/problem.py:1407: UserWarning: Solution may be inaccurate. Try another solver, adjusting the solver settings, or solve with verbose=True for more information.
  warnings.warn(

I'm not sure if the above warning message(s) is cause for concern, but at the completion of the program, this is the count of splits in inter.tsv:

Split
not selected    11383
train            4409
val              1319
test              848

Most of the data (~63%) here is "not selected". I'm wondering if that's expected, if this result might be related to the warning message above related to clustering or the UserWarning related to the "Solution may be inaccurate", or if you have any suggestions of increasing the fraction of data assigned to splits here?

In the associated *_proteins_split.tsv and *_mols_splits.tsv, each entry is assigned to a split (there are no "no selected" values there).

For what it's worth, I get similar warning messages and results if I specify cdhit as the protein clustering algorithm.

What do you expect?

I would expect that the majority (or ideally, all) of the data would end up in the train, val, or test splits. Not that it would be mostly left out of these splits. That's a lot of data to leave behind for training and evaluating a model.

Environment (please complete the following information):

Additional context

I don't think it's related, but I did have to downgrade numpy based on an error message I got when installing datasail dependencies (it appeared to trace back to a call from the grakel library):

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.
...
...
...
  File "/fsx/home/imiller/miniforge3/envs/datasail/lib/python3.10/site-packages/grakel/kernels/kernel.py", line 17, in <module>
    from grakel.kernels._c_functions import k_to_ij_triangular
  File "grakel/kernels/_c_functions/functions.pyx", line 1, in init grakel.kernels._c_functions
ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it).

Running this command resolved the above issue:

mamba install 'numpy<2'

Please let me know if you need any further information here.