Closed skinnider closed 1 month ago
Looking at workflow/Snakefile_data more closely, is this because there are filepaths set in this file (under rule data:
) rather than the config?
Ah - yes I can recreate the problem locally. @skinnider - let me get back to you on this shortly.
@skinnider - this seems to be happening because in your config.yaml
, the specification of representations
has gone from being a list of strings to a string. This is not the only place this has happened, but this seems to be the culprit (see screenshot). In general, I think it's safest to keep the same datatype, even if you just need a single value, not multiple.
For example, you might want to use:
representations:
- SMILES
...
enum_factors:
- 30
...
@skinnider - to your point - yes, Snakefile
does have hardcoded paths instead of looking at config.yaml
, unlike Snakefile_data
. This needs to be fixed going forward, and this may still be an issue in your tailored config.yaml
file. But I doubt this is the source of the error you're seeing here.
Perhaps you'd consider running the data generation part of the pipeline first
snakemake --snakefile Snakefile_data --configfile ..
and see if that part runs to completion?
My mistake - I'm trying to edit these config files programmatically in R and missed that the data types had changed.
I fixed the config.yaml file such that the first few lines look like this:
representations:
- SMILES
folds: 5
train_seeds:
- 0
sample_seeds:
- 0
enum_factors:
- 30
[...]
However, running only the Snakefile_data part of the pipeline is still giving a MissingInputException:
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --snakefile workflow/Snakefile_data --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 851724
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule data in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile_data, line 34:
Missing input files for rule data:
affected files:
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_2_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_0_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-sum_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_fp10k_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-avg_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_fp10k_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_2_CV_ranks_formula.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_fp10k_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-sum_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-avg_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-sum_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_fp10k_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_fp10k_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-sum_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_4_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_4_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-avg_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-avg_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_fp10k_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_fp10k_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_1_CV_ranks_formula.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-sum_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-avg_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_ranks_formula.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min2_all_freq-sum_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min3_all_freq-avg_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_1_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_fp10k_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_1_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_0_CV_ranks_formula.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_2_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-sum_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_4_CV_ranks_formula.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min4_all_freq-sum_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_3_CV_tc.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_0_CV_tc.csv.gz
I think the issue is that the hardcoded paths in the Snakefile - if I replace some of these with the corresponding entries from the config file as captured in PATHS
(cf. https://github.com/skinniderlab/CLM/blob/mas/config-filepaths/workflow/Snakefile_data) and run the exact same command, I see that the "missing input files" errors disappear:
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 866904
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule plot_topk in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile, line 331:
Missing input files for rule plot_topk:
output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/plot/30/topk
wildcards: enum_factor=30
affected files:
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz
Is there any reason not to replace the hardcoded paths with the values in the config file?
You're right - there's no reason we shouldn't be getting it from the config
. PR #230 fixes this, and can be merged if the CI passes.
Oops @vineetbansal - I just reran the workflow and maybe shouldn't have merged the PR so quickly - I think there are a few more:
in Snakefile_data
:
in Snakefile
:
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ snakemake --configfile /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/config.yaml --jobs 10 &
[1] 1188570
(/Genomics/skinniderlab/PED-generation/env-clm) [ms0270@argo-beta CLM]$ Building DAG of jobs...
MissingInputException in rule plot_topk in file /Genomics/argo/users/ms0270/git/CLM/workflow/Snakefile, line 331:
Missing input files for rule plot_topk:
output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/plot/30/topk
wildcards: enum_factor=30
affected files:
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
/Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/structural_prior/structures_SMILES_min1_all_freq-avg_CV_tc.csv.gz
Maybe this is on me for trying to change the directory structure and filenames - is the idea that these should be fixed?
Yes - the ideal thing to do here would be to replace all input/output file paths to come from config, so you could change them like you are doing in your case (as long as you preserve the same wildcards in the path patterns). Though we didn't anticipate users to modify that section at all, it should be fair game to do so.
I tried your config.yaml
and it was giving me a clean --dry-run
, which is why I thought I'd taken care of the rules that might be affected. Can you post your config.yaml
again here (with the fixed dtypes)? I'll go through it again and make sure I don't miss any.
representations:
- SMILES
folds: 5
train_seeds:
- 0
sample_seeds:
- 0
enum_factors:
- 30
max_input_smiles: 0
model_params:
rnn_type: LSTM
embedding_size: 128
hidden_size: 1024
n_layers: 3
dropout: 0
batch_size: 64
learning_rate: 0.001
max_epochs: 999999
patience: 50000
log_every_steps: 100
log_every_epochs: 1
sample_mols: 10000000
metrics:
- freq-sum
- freq-avg
- fp10k
min_tc: 0
top_k: 30
err_ppm: 10
structural_prior_min_freq:
- 1
- 2
- 3
- 4
random_seed: 42
paths:
output_dir: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30
dataset: /Genomics/skinniderlab/food-clm/inputs/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1/structures.smi
pubchem_tsv_file: /Genomics/skinniderlab/food-clm/PubChem.tsv
preprocess_output: '{output_dir}/raw/dataset.txt'
train_file: '{output_dir}/inputs/train_{fold}.smi'
vocab_file: '{output_dir}/inputs/train_{fold}.vocabulary'
model_file: '{output_dir}/models/{fold}_{train_seed}_model.pt'
input_file: '{output_dir}/samples/{fold}_{train_seed}_{sample_seed}_samples.csv.gz'
train0_file: '{output_dir}/inputs/train0_{fold}.smi'
test0_file: '{output_dir}/inputs/test0_{fold}.smi'
sample_file: '{output_dir}/samples/{fold}_unique_masses.csv.gz'
carbon_file: '{output_dir}/inputs/train0_{fold}_carbon.csv.gz'
train_all_file: '{output_dir}/inputs/train_all.smi'
test_all_file: '{output_dir}/inputs/test_all.smi'
carbon_all_file: '{output_dir}/inputs/train_carbon_all.csv.gz'
cv_ranks_file: '{output_dir}/structural_prior/{fold}_CV_ranks_structure.csv.gz'
cv_tc_file: '{output_dir}/structural_prior/{fold}_CV_tc.csv.gz'
formula_ranks_file: '{output_dir}/structural_prior/{fold}_CV_ranks_formula.csv.gz'
process_tabulated_output: '{output_dir}/samples/processed_min{min_freq}_{metric}.csv.gz'
loss_file: '{output_dir}/models/{fold}_{train_seed}_loss.csv.gz'
tabulate_molecules_output: '{output_dir}/samples/{fold}_{train_seed}_{sample_seed}_samples_masses.csv.gz'
collect_tabulated_output: '{output_dir}/samples/{fold}_unique_masses.csv.gz'
overall_ranks_file: '{output_dir}/structural_prior/min{min_freq}_all_{metric}_CV_ranks_structure.csv.gz'
overall_tc_file: '{output_dir}/structural_prior/min{min_freq}_all_{metric}_CV_tc.csv.gz'
known_smiles_file: '{output_dir}/samples/known_{fold}_{train_seed}_{sample_seed}_samples_masses.csv.gz'
invalid_smiles_file: '{output_dir}/samples/invalid_{fold}_{train_seed}_{sample_seed}_samples_masses.csv.gz'
collect_known_smiles: '{output_dir}/samples/known_{fold}_unique_masses.csv.gz'
collect_invalid_smiles: '{output_dir}/samples/invalid_{fold}_unique_masses.csv.gz'
@skinnider - @anushka255 and I looked at this workflow and our conclusion here is that you can't really remove the wildcards (anything in {}
) in the file paths, even though you can mess around with the rest of the path. This is because the rules expect to be able to determine the value of these wildcards, and removing them confuses the workflow (because it tries to use regular expressions under the hood to determine the values of the rest of the wildcards).
The wildcards that you've removed here are enum_factor
, dataset
, and repr
. While we can remove support for dataset
and repr
if they're getting in the way (and we haven't tested the workflow with repr=SELFIES
at all), I think we do need enum_factor
in there.
@skinnider - your point about tweaking Snakefile
and Snakefile_data
still stands, in addition to the observation above. A PR on that is coming soon.
Got it. I am still not really sure I understand what's going on here... e.g., why does the preprocess_output
path require just one wildcard, dataset
(preprocess_output: '{output_dir}/prior/raw/{dataset}.txt'
)? But in any case I restored all of the original filepaths and was able to get a test job running via slurm without errors.
That's because the preprocess
step only involves cleanup of the input dataset (no augmentation just yet), and is independent of the enum_factor
, and all subsequent steps can work with the identical input.
I guess to be more clear, my question was: when I rewrite that line as something like this...
preprocess_output: '{output_dir}/prior/raw/my-dataset-name.txt'
... I still get an error related to the missing {dataset}
wildcard. Where in the code is it specified that this rule needs access to {dataset}
?
You're probably talking about the presence of (or the downstream effects of):
rule create_training_sets:
...
input:
"{output_dir}/prior/raw/{dataset}.txt"
...
where the input
needs to match the preprocess_output
value in the config
, so the latter cannot be changed independently. This should be fixed after the PR #231 was merged, so it should work on the master
branch now, even with your modifications.
got it. Thanks for the clarification.
Trying to start testing out the workflow by submitting some real jobs, and getting the following error:
My config.yaml: config.yaml.zip
I don't see the 'affected files' in the YAML. Are these potentially hard-coded somewhere else? Or is there a field missing from the default config.yaml?