skinniderlab / CLM

MIT License
0 stars 0 forks source link

MissingInputException #228

Closed skinnider closed 1 month ago

skinnider commented 1 month ago

Trying to start testing out the workflow by submitting some real jobs, and getting the following error:

(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 532561
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule plot_topk in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile, line 331:
Missing input files for rule plot_topk:
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/plot/30/topk
    wildcards: enum_factor=30
    affected files:
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz

My config.yaml: config.yaml.zip

I don't see the 'affected files' in the YAML. Are these potentially hard-coded somewhere else? Or is there a field missing from the default config.yaml?

skinnider commented 1 month ago

Looking at workflow/Snakefile_data more closely, is this because there are filepaths set in this file (under rule data:) rather than the config?

vineetbansal commented 1 month ago

Ah - yes I can recreate the problem locally. @skinnider - let me get back to you on this shortly.

vineetbansal commented 1 month ago

@skinnider - this seems to be happening because in your config.yaml, the specification of representations has gone from being a list of strings to a string. This is not the only place this has happened, but this seems to be the culprit (see screenshot). In general, I think it's safest to keep the same datatype, even if you just need a single value, not multiple.

For example, you might want to use:

representations:
  - SMILES
...
enum_factors:
  - 30  
...  

Screenshot from 2024-07-16 17-00-22

vineetbansal commented 1 month ago

@skinnider - to your point - yes, Snakefile does have hardcoded paths instead of looking at config.yaml, unlike Snakefile_data. This needs to be fixed going forward, and this may still be an issue in your tailored config.yaml file. But I doubt this is the source of the error you're seeing here.

Perhaps you'd consider running the data generation part of the pipeline first

snakemake --snakefile Snakefile_data --configfile ..

and see if that part runs to completion?

skinnider commented 1 month ago

My mistake - I'm trying to edit these config files programmatically in R and missed that the data types had changed.

I fixed the config.yaml file such that the first few lines look like this:

representations:
- SMILES
folds: 5
train_seeds:
- 0
sample_seeds:
- 0
enum_factors:
- 30
[...]

However, running only the Snakefile_data part of the pipeline is still giving a MissingInputException:

(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --snakefile workflow/Snakefile_data --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 851724
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule data in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile_data, line 34:
Missing input files for rule data:
    affected files:
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_2_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-sum_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_fp10k_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-avg_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_fp10k_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_2_CV_ranks_formula.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_fp10k_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-sum_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-avg_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-sum_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_fp10k_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_fp10k_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-sum_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_4_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_4_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-avg_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-avg_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_fp10k_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_fp10k_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_1_CV_ranks_formula.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-sum_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-avg_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_formula.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-sum_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-avg_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_1_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_fp10k_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_1_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_0_CV_ranks_formula.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_2_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-sum_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_4_CV_ranks_formula.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-sum_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_tc.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz
skinnider commented 1 month ago

I think the issue is that the hardcoded paths in the Snakefile - if I replace some of these with the corresponding entries from the config file as captured in PATHS (cf. https://github.com/skinniderlab/CLM/blob/mas/config-filepaths/workflow/Snakefile_data) and run the exact same command, I see that the "missing input files" errors disappear:

(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 866904
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule plot_topk in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile, line 331:
Missing input files for rule plot_topk:
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/plot/30/topk
    wildcards: enum_factor=30
    affected files:
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz

Is there any reason not to replace the hardcoded paths with the values in the config file?

vineetbansal commented 1 month ago

You're right - there's no reason we shouldn't be getting it from the config. PR #230 fixes this, and can be merged if the CI passes.

skinnider commented 1 month ago

Oops @vineetbansal - I just reran the workflow and maybe shouldn't have merged the PR so quickly - I think there are a few more:

in Snakefile_data:

in Snakefile:

(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 1188570
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule plot_topk in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile, line 331:
Missing input files for rule plot_topk:
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/plot/30/topk
    wildcards: enum_factor=30
    affected files:
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
        /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz
skinnider commented 1 month ago

Maybe this is on me for trying to change the directory structure and filenames - is the idea that these should be fixed?

vineetbansal commented 1 month ago

Yes - the ideal thing to do here would be to replace all input/output file paths to come from config, so you could change them like you are doing in your case (as long as you preserve the same wildcards in the path patterns). Though we didn't anticipate users to modify that section at all, it should be fair game to do so.

I tried your config.yaml and it was giving me a clean --dry-run, which is why I thought I'd taken care of the rules that might be affected. Can you post your config.yaml again here (with the fixed dtypes)? I'll go through it again and make sure I don't miss any.

skinnider commented 1 month ago
representations:
- SMILES
folds: 5
train_seeds:
- 0
sample_seeds:
- 0
enum_factors:
- 30
max_input_smiles: 0
model_params:
  rnn_type: LSTM
  embedding_size: 128
  hidden_size: 1024
  n_layers: 3
  dropout: 0
  batch_size: 64
  learning_rate: 0.001
  max_epochs: 999999
  patience: 50000
  log_every_steps: 100
  log_every_epochs: 1
  sample_mols: 10000000
metrics:
- freq-sum
- freq-avg
- fp10k
min_tc: 0
top_k: 30
err_ppm: 10
structural_prior_min_freq:
- 1
- 2
- 3
- 4
random_seed: 42
paths:
  output_dir: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30
  dataset: /Genomics/skinniderlab/food-clm/inputs/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1/structures.smi
  pubchem_tsv_file: /Genomics/skinniderlab/food-clm/PubChem.tsv
  preprocess_output: '{output_dir}/raw/dataset.txt'
  train_file: '{output_dir}/inputs/train_{fold}.smi'
  vocab_file: '{output_dir}/inputs/train_{fold}.vocabulary'
  model_file: '{output_dir}/models/{fold}_{train_seed}_model.pt'
  input_file: '{output_dir}/samples/{fold}_{train_seed}_{sample_seed}_samples.csv.gz'
  train0_file: '{output_dir}/inputs/train0_{fold}.smi'
  test0_file: '{output_dir}/inputs/test0_{fold}.smi'
  sample_file: '{output_dir}/samples/{fold}_unique_masses.csv.gz'
  carbon_file: '{output_dir}/inputs/train0_{fold}_carbon.csv.gz'
  train_all_file: '{output_dir}/inputs/train_all.smi'
  test_all_file: '{output_dir}/inputs/test_all.smi'
  carbon_all_file: '{output_dir}/inputs/train_carbon_all.csv.gz'
  cv_ranks_file: '{output_dir}/structural_prior/{fold}_CV_ranks_structure.csv.gz'
  cv_tc_file: '{output_dir}/structural_prior/{fold}_CV_tc.csv.gz'
  formula_ranks_file: '{output_dir}/structural_prior/{fold}_CV_ranks_formula.csv.gz'
  process_tabulated_output: '{output_dir}/samples/processed_min{min_freq}_{metric}.csv.gz'
  loss_file: '{output_dir}/models/{fold}_{train_seed}_loss.csv.gz'
  tabulate_molecules_output: '{output_dir}/samples/{fold}_{train_seed}_{sample_seed}_samples_masses.csv.gz'
  collect_tabulated_output: '{output_dir}/samples/{fold}_unique_masses.csv.gz'
  overall_ranks_file: '{output_dir}/structural_prior/min{min_freq}_all_{metric}_CV_ranks_structure.csv.gz'
  overall_tc_file: '{output_dir}/structural_prior/min{min_freq}_all_{metric}_CV_tc.csv.gz'
  known_smiles_file: '{output_dir}/samples/known_{fold}_{train_seed}_{sample_seed}_samples_masses.csv.gz'
  invalid_smiles_file: '{output_dir}/samples/invalid_{fold}_{train_seed}_{sample_seed}_samples_masses.csv.gz'
  collect_known_smiles: '{output_dir}/samples/known_{fold}_unique_masses.csv.gz'
  collect_invalid_smiles: '{output_dir}/samples/invalid_{fold}_unique_masses.csv.gz'
vineetbansal commented 1 month ago

@skinnider - @anushka255 and I looked at this workflow and our conclusion here is that you can't really remove the wildcards (anything in {}) in the file paths, even though you can mess around with the rest of the path. This is because the rules expect to be able to determine the value of these wildcards, and removing them confuses the workflow (because it tries to use regular expressions under the hood to determine the values of the rest of the wildcards).

The wildcards that you've removed here are enum_factor, dataset, and repr. While we can remove support for dataset and repr if they're getting in the way (and we haven't tested the workflow with repr=SELFIES at all), I think we do need enum_factor in there.

vineetbansal commented 1 month ago

@skinnider - your point about tweaking Snakefile and Snakefile_data still stands, in addition to the observation above. A PR on that is coming soon.

skinnider commented 1 month ago

Got it. I am still not really sure I understand what's going on here... e.g., why does the preprocess_output path require just one wildcard, dataset (preprocess_output: '{output_dir}/prior/raw/{dataset}.txt')? But in any case I restored all of the original filepaths and was able to get a test job running via slurm without errors.

vineetbansal commented 1 month ago

That's because the preprocess step only involves cleanup of the input dataset (no augmentation just yet), and is independent of the enum_factor, and all subsequent steps can work with the identical input.

skinnider commented 1 month ago

I guess to be more clear, my question was: when I rewrite that line as something like this...

preprocess_output: '{output_dir}/prior/raw/my-dataset-name.txt'

... I still get an error related to the missing {dataset} wildcard. Where in the code is it specified that this rule needs access to {dataset}?

vineetbansal commented 1 month ago

You're probably talking about the presence of (or the downstream effects of):

rule create_training_sets:
   ...
    input:
        "{output_dir}/prior/raw/{dataset}.txt"
   ...

where the input needs to match the preprocess_output value in the config, so the latter cannot be changed independently. This should be fixed after the PR #231 was merged, so it should work on the master branch now, even with your modifications.

skinnider commented 1 month ago

got it. Thanks for the clarification.