KeyError: "['input_smiles'] not found in axis"

skinnider commented 3 months ago

write_structural_prior_CV is failing seemingly because the "input_smiles" column does not exist in the AddCarbon file:

[Thu Jul 18 23:23:27 2024]
rule write_structural_prior_CV:
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/test0_structures_SMILES_3.smi, /Genomics/skinniderlab/food-clm/PubChem.tsv, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/samples/structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3_carbon.csv.gz
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_structure.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_tc.csv.gz
    jobid: 0
    reason: Forced execution
    wildcards: output_dir=/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30, enum_factor=30, dataset=structures, repr=SMILES, fold=3
    resources: mem_mb=64000, mem_mib=61036, disk_mb=12342, disk_mib=11771, tmpdir=/tmp, slurm_partition=main,hoppertest,skinniderlab, runtime=1015

reading NP model ...
model in
(INFO) (__main__.py) (18-Jul-24 23:23:30) CLM vsrc
  0%|          | 0/1 [00:00<?, ?it/s]/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py:466: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, pd.DataFrame(chunk_data)], ignore_index=True)
100%|██████████| 1/1 [00:19<00:00, 19.31s/it]
  0%|          | 0/1 [00:00<?, ?it/s]/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py:466: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, pd.DataFrame(chunk_data)], ignore_index=True)
100%|██████████| 1/1 [00:04<00:00,  4.67s/it]
(INFO) (write_structural_prior_CV.py) (18-Jul-24 23:24:02) Reading PubChem file
(INFO) (write_structural_prior_CV.py) (18-Jul-24 23:25:43) Reading sample file from generative model
/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py:524: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  return pd.read_csv(filename, compression=compression, **kwargs)
Traceback (most recent call last):
  File "/Genomics/skinniderlab/PED-generation/env-clm/bin/clm", line 8, in <module>
    sys.exit(main())
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/__main__.py", line 76, in main
    args.func(args)
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/write_structural_prior_CV.py", line 297, in main
    write_structural_prior_CV(
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/write_structural_prior_CV.py", line 253, in write_structural_prior_CV
    addcarbon.drop(columns="input_smiles", inplace=True)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/frame.py", line 5581, in drop
    return super().drop(
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/generic.py", line 4788, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/generic.py", line 4830, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 7070, in drop
    raise KeyError(f"{labels[mask].tolist()} not found in axis")
KeyError: "['input_smiles'] not found in axis"
[Thu Jul 18 23:25:54 2024]
Error in rule write_structural_prior_CV:
    jobid: 0
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/test0_structures_SMILES_3.smi, /Genomics/skinniderlab/food-clm/PubChem.tsv, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/samples/structures_SMILES_3_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3_carbon.csv.gz
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_structure.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_tc.csv.gz
    conda-env: clm
    shell:
        clm write_structural_prior_CV --ranks_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_structure.csv.gz --tc_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_tc.csv.gz --train_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3.smi --test_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/test0_structures_SMILES_3.smi --pubchem_file /Genomics/skinniderlab/food-clm/PubChem.tsv --sample_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/samples/structures_SMILES_3_unique_masses.csv.gz --err_ppm 10 --seed 42 --carbon_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3_carbon.csv.gz --top_n 30
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
srun: error: argo-07: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=5633068.0

Indeed, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_3_carbon.csv.gz has no header (which I think write_structural_prior_CV assumes?) and contains only 920 rows, vs. 42236 in the input SMILES file (so AddCarbon should have substantially more than this).

I'm wondering if something went wrong with the AddCarbon step but somehow a file was written anyway?

skinnider commented 3 months ago

Looking at the slurm logs, it seems like this job ran out of time, but because the train0_structures_SMILES_3_carbon.csv.gz file was not empty, the next rule was executed. Is there a way to also require that the -unique.smi file written at the end of rule add_carbon also exists?

skinnider commented 3 months ago

More generally, is it worth building some robustness checks to ensure that the preceding job completed and not just that it wrote output to a file? for example, looking at sample_molecules_RNN, is it the case that once a single row is written to the CSV file, the next jobs are free to execute, regardless of whether something goes wrong halfway through the job (e.g., the job is killed by slurm)?

skinnider commented 3 months ago

I'm going to delete these files and resubmit the jobs with more wall time, but attaching the files and some representative logs here for future reference. AddCarbon files and logs.zip

vineetbansal commented 3 months ago

@skinnider - The input/output files are used to generate a dependency graph before the workflow starts executing. snakemake will never execute a downstream rule unless an upstream rule it depends on (directly or indirectly) has completed with an error code of 0. So a rule is free to stream to the output file its supposed to generate. If it runs out of time, or otherwise errors out, or fails to produce the output files that were part of the dependency graph, the rule is marked as having failed (and downstream rules are still considered pending), and any output file(s) it generated (which were part of the dependency graph - it is free to generate other files that snakemake knows nothing about) are deleted by snakemake .

A rule can generate other files that are not part of the dependency graph - these files are not touched by snakemake.

So I suspect there's something else going on here other than the slurm timeout. I'm investigating and will keep you posted..

vineetbansal commented 3 months ago

@skinnider - in the log files you attached here, I see:

(INFO) (__main__.py) (18-Jul-24 23:22:44) CLM vsrc

This tells me that you're using clm without pip installing it first. This is fine, but it makes it tricky to find out exactly which version you used to run the workflow. Would you mind uploading your version of add_carbon.py here? In the current version on master, I'm not seeing how any *carbon.gz file could possibly be generated without a header line (as the ones I see here).

skinnider commented 3 months ago

I do have the CLM packaged pip install'd but was running from within ~/git/CLM. Does it run from source via default like that? Regardless, here's my copy of add_carbon.py: add_carbon.py.zip

skinnider commented 3 months ago

As far as I can tell, the only commit between the error (Jul 19 am) and successful re-submission of the same jobs/DAG (Jul 20 or 21) was to increase runtime and memory for rule add_carbon.

skinnider commented 2 months ago

per discussion on Zoom: https://github.com/skinniderlab/CLM/blob/master/src/clm/commands/sample_molecules_RNN.py#L109C13-L109C30 https://github.com/skinniderlab/CLM/blob/master/src/clm/commands/tabulate_molecules.py#L81

vineetbansal commented 2 months ago

So it does look like the add_carbon step is the only place in our codebase that is taking the else branch in the write_to_csv_file (i.e. the only place where the input is not a DataFrame). I'll open up a PR on this soon..

skinnider commented 2 months ago

btw, still getting this same error with newly-submitted jobs. Providing the filepaths in the log below in case they are useful to test the PR:

[Tue Aug 13 15:54:50 2024]
rule write_structural_prior_CV:
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/test0_structures_SMILES_0.smi, /Genomics/skinniderlab/food-clm/PubChem.tsv, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/samples/structures_SMILES_0_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0_carbon.csv.gz
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz
    jobid: 0
    reason: Forced execution
    wildcards: output_dir=/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3, enum_factor=100, dataset=structures, repr=SMILES, fold=0
    resources: mem_mb=64000, mem_mib=61036, disk_mb=12214, disk_mib=11649, tmpdir=/tmp, slurm_partition=main,hoppertest,skinniderlab, runtime=1015

reading NP model ...
model in
(INFO) (__main__.py) (13-Aug-24 15:54:53) CLM vsrc
  0%|          | 0/1 [00:00<?, ?it/s]/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py:466: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, pd.DataFrame(chunk_data)], ignore_index=True)
100%|██████████| 1/1 [00:06<00:00,  6.71s/it]
  0%|          | 0/1 [00:00<?, ?it/s]/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py:466: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  df = pd.concat([df, pd.DataFrame(chunk_data)], ignore_index=True)
100%|██████████| 1/1 [00:01<00:00,  1.58s/it]
(INFO) (write_structural_prior_CV.py) (13-Aug-24 15:55:04) Reading PubChem file
(INFO) (write_structural_prior_CV.py) (13-Aug-24 15:57:04) Reading sample file from generative model
/Genomics/argo/users/ms0270/git/CLM/src/clm/functions.py:524: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  return pd.read_csv(filename, compression=compression, **kwargs)
Traceback (most recent call last):
  File "/Genomics/skinniderlab/PED-generation/env-clm/bin/clm", line 8, in <module>
    sys.exit(main())
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/__main__.py", line 76, in main
    args.func(args)
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/write_structural_prior_CV.py", line 297, in main
    write_structural_prior_CV(
  File "/Genomics/argo/users/ms0270/git/CLM/src/clm/commands/write_structural_prior_CV.py", line 253, in write_structural_prior_CV
    addcarbon.drop(columns="input_smiles", inplace=True)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/frame.py", line 5581, in drop
    return super().drop(
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/generic.py", line 4788, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/generic.py", line 4830, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/Genomics/skinniderlab/PED-generation/env-clm/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 7070, in drop
    raise KeyError(f"{labels[mask].tolist()} not found in axis")
KeyError: "['input_smiles'] not found in axis"
[Tue Aug 13 15:57:12 2024]
Error in rule write_structural_prior_CV:
    jobid: 0
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/test0_structures_SMILES_0.smi, /Genomics/skinniderlab/food-clm/PubChem.tsv, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/samples/structures_SMILES_0_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0_carbon.csv.gz
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz
    conda-env: clm
    shell:
        clm write_structural_prior_CV --ranks_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz --tc_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz --train_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0.smi --test_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/test0_structures_SMILES_0.smi --pubchem_file /Genomics/skinniderlab/food-clm/PubChem.tsv --sample_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/samples/structures_SMILES_0_unique_masses.csv.gz --err_ppm 10 --seed 42 --carbon_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0_carbon.csv.gz --top_n 30
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
srun: error: argo-28: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=5898275.0
[Tue Aug 13 15:57:15 2024]
Error in rule write_structural_prior_CV:
    jobid: 0
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/test0_structures_SMILES_0.smi, /Genomics/skinniderlab/food-clm/PubChem.tsv, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/samples/structures_SMILES_0_unique_masses.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0_carbon.csv.gz
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz
    conda-env: clm
    shell:
        clm write_structural_prior_CV --ranks_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz --tc_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz --train_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0.smi --test_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/test0_structures_SMILES_0.smi --pubchem_file /Genomics/skinniderlab/food-clm/PubChem.tsv --sample_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/samples/structures_SMILES_0_unique_masses.csv.gz --err_ppm 10 --seed 42 --carbon_file /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=ecfp4-metric=tc-optimizer=feature_based-rarefaction=0.3/100/prior/inputs/train0_structures_SMILES_0_carbon.csv.gz --top_n 30
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

skinnider commented 1 month ago

I've now re-run 4 sets of Snakemake runs (5 enum factors each) that were giving these errors before with no issues so far, so I'm going to close this and #238. Thanks for your help with both @vineetbansal!

skinniderlab / CLM

KeyError: "['input_smiles'] not found in axis" #234