skinniderlab / CLM

MIT License
2 stars 0 forks source link

numpy.linalg.LinAlgError in calculate_outcomes #238

Closed skinnider closed 2 months ago

skinnider commented 3 months ago

Getting this error on argo, which I haven't seen before:

[Mon Jul 29 07:45:35 2024]
rule calculate_outcomes:
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/known_structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/invalid_structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/inputs/train0_structures_SMILES_3.smi
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/model_evaluation/10/structures_SMILES_3_calculate_outcomes.csv.gz
    jobid: 0
    reason: Forced execution
    wildcards: output_dir=/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05, enum_factor=10, dataset=structures, repr=SMILES, fold=3
    resources: mem_mb=256000, mem_mib=244141, disk_mb=1000, disk_mib=954, tmpdir=/tmp, slurm_partition=main,hoppertest,skinniderlab, runtime=1000

reading NP model ...
model in
(INFO) (__main__.py) (29-Jul-24 07:45:41) CLM vsrc
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for frequency bin '1-1'. Using 74722 molecules.
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for frequency bin '2-2'. Using 9397 molecules.
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for frequency bin '3-10'. Using 12666 molecules.
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for frequency bin '11-30'. Using 4194 molecules.
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for frequency bin '31-100'. Using 1509 molecules.
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for frequency bin '101-'. Using 6534 molecules.
(WARNING) (functions.py) (29-Jul-24 07:45:42) Not enough molecules for 500000, using 109022 instead.
(INFO) (calculate_outcomes.py) (29-Jul-24 07:45:42) Reading training smiles from /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/inputs/train0_structures_SMILES_3.smi
(INFO) (calculate_outcomes.py) (29-Jul-24 07:46:04) Reading sample smiles from                                                    smiles  size  ...  is_novel   bin
0       CCCCCC=CCC=CCC=CCC=CCCCCCC(=O)OC(COC(=O)CCCCC=...  2123  ...     False  101-
1       CCCCCC=CCC=CCC=CCC=CCCCCCC(=O)OC(COC(=O)CCCCC=...  1751  ...     False  101-
2       CCCCCC=CCC=CCC=CCC=CCCCCCC(=O)OC(COC(=O)CCCCC=...  1743  ...     False  101-
3       CCCCCC=CCC=CCC=CCC=CCCCCCC(=O)OC(COC(=O)CCCCC=...  2405  ...     False  101-
4       CCCCCC=CCC=CCC=CCC=CCCCCCC(=O)OC(COC(=O)CCCCC=...  1877  ...     False  101-
...                                                   ...   ...  ...       ...   ...
218039  CCCCCC=CCC=CCC=CCCCCCCCC(=O)OCC(COC(=O)CCCCCCC...     3  ...      True   all
218040  CCCCCCCCCCCCC(=O)OCC(CCC)CCCCCCCCCCCCCCC(=O)OC...     1  ...      True   all
218041  CCCCCC=CCC=CCC=CCCCCCC(=O)OCC(COC(=O)CCCCCCCCC...    11  ...      True   all
218042  CCCCCCCCCCCCCCC(=O)OC(COC(=O)CCCCCCCCCCCCC)COC...     3  ...      True   all
218043  CCCCCC=CCC=CCCCCCCCCCC(=O)OCC(COC(=O)CCCCCCCCC...     2  ...      True   all

[218044 rows x 5 columns]
100%|██████████| 168426/168426 [31:57<00:00, 87.84it/s]
(INFO) (calculate_outcomes.py) (29-Jul-24 08:19:31) 168426 valid SMILES out of 218044
(INFO) (calculate_outcomes.py) (29-Jul-24 08:19:31) 213806 novel SMILES out of 218044
(INFO) (calculate_outcomes.py) (29-Jul-24 08:19:36) Calculating outcomes
(INFO) (calculate_outcomes.py) (29-Jul-24 08:19:36) Calculating outcomes for bin 1-1
Traceback (most recent call last):
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/scipy/stats/_kde.py", line 223, in __init__
    self.set_bandwidth(bw_method=bw_method)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/scipy/stats/_kde.py", line 571, in set_bandwidth
    self._compute_covariance()
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/scipy/stats/_kde.py", line 583, in _compute_covariance
    self._data_cho_cov = linalg.cholesky(self._data_covariance,
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/scipy/linalg/_decomp_cholesky.py", line 89, in cholesky
    c, lower = _cholesky(a, lower=lower, overwrite_a=overwrite_a, clean=True,
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/scipy/linalg/_decomp_cholesky.py", line 37, in _cholesky
    raise LinAlgError("%d-th leading minor of the array is not positive "
numpy.linalg.LinAlgError: 1-th leading minor of the array is not positive definite

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Genomics/skinniderlab/PED-generation/env-clm/bin/clm", line 8, in <module>
    sys.exit(main())
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/__main__.py", line 76, in main
    args.func(args)
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/calculate_outcomes.py", line 352, in main
    calculate_outcomes(
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/calculate_outcomes.py", line 341, in calculate_outcomes
    out = calculate_outcomes_dataframe(sample_df, train_df)
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/calculate_outcomes.py", line 297, in calculate_outcomes_dataframe
    if _out := handle_bin(
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/calculate_outcomes.py", line 235, in handle_bin
    "Jensen-Shannon distance, TPSA": continuous_JSD(
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py", line 302, in continuous_JSD
    org_kde = gaussian_kde(original_dist)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/scipy/stats/_kde.py", line 232, in __init__
    raise linalg.LinAlgError(msg) from e
numpy.linalg.LinAlgError: The data appears to lie in a lower-dimensional subspace of the space in which it is expressed. This has resulted in a singular data covariance matrix, which cannot be treated using the algorithms implemented in `gaussian_kde`. Consider performing principle component analysis / dimensionality reduction and using `gaussian_kde` with the transformed data.
[Mon Jul 29 08:19:43 2024]
Error in rule calculate_outcomes:
    jobid: 0
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/known_structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/invalid_structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/inputs/train0_structures_SMILES_3.smi
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/model_evaluation/10/structures_SMILES_3_calculate_outcomes.csv.gz
    conda-env: clm
    shell:
        clm calculate_outcomes --train_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/inputs/train0_structures_SMILES_3.smi --sampled_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/structures_SMILES_3_unique_masses.csv.gz --known_smiles_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/known_structures_SMILES_3_unique_masses.csv.gz --invalid_smiles_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/invalid_structures_SMILES_3_unique_masses.csv.gz --max_molecules 500000 --seed 12 --output_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/model_evaluation/10/structures_SMILES_3_calculate_outcomes.csv.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
srun: error: argo-29: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=5722433.0
[Mon Jul 29 08:19:44 2024]
Error in rule calculate_outcomes:
    jobid: 0
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/known_structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/invalid_structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/inputs/train0_structures_SMILES_3.smi
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/model_evaluation/10/structures_SMILES_3_calculate_outcomes.csv.gz
    conda-env: clm
    shell:
        clm calculate_outcomes --train_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/inputs/train0_structures_SMILES_3.smi --sampled_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/structures_SMILES_3_unique_masses.csv.gz --known_smiles_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/known_structures_SMILES_3_unique_masses.csv.gz --invalid_smiles_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/10/prior/samples/invalid_structures_SMILES_3_unique_masses.csv.gz --max_molecules 500000 --seed 12 --output_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.05/model_evaluation/10/structures_SMILES_3_calculate_outcomes.csv.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
vineetbansal commented 2 months ago

@skinnider - I ran the calculate_outcomes step locally on these files, and the error seems to be arising from the fact that all 2119 training smiles in train0_structures_SMILES_3.smi (all of which are unique and valid) have the same TPSA value (rdkit.Chem.MolSurf.TPSA) of 78.9. In fact, this is value reported for all smiles, even if I use train0_structures_SMILES_4.smi or any other fold.

It seems like this might be related to https://github.com/rdkit/rdkit/discussions/5925. I'm investigating a bit more..