skinniderlab / CLM

MIT License
0 stars 0 forks source link

nn_tc_ever_v_never runs right away #232

Closed skinnider closed 3 months ago

skinnider commented 3 months ago

Working through a test run of the entire pipeline on a new dataset, and I noticed that immediately after create_training_sets finishes (i.e., before the CLMs themselves are even trained), rule nn_tc_ever_v_never is executed:

[Thu Jul 18 08:48:03 2024]
rule nn_tc_ever_v_never:
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/test0_structures_SMILES_2.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_2.smi
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/30/structures_SMILES_2_nn_tc_ever_v_never.csv.gz
    jobid: 74
    reason: Missing output files: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/30/structures_SMILES_2_nn_tc_ever_v_never.csv.gz
    wildcards: enum_factor=30, dataset=structures, repr=SMILES, fold=2
    resources: mem_mb=64000, mem_mib=61036, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, runtime=1000

Job 74 has been submitted with SLURM jobid 5631035 (log: .snakemake/slurm_logs/rule_nn_tc_ever_v_never/5631035.log).

The output of the .log file is as follows:

cat .snakemake/slurm_logs/rule_nn_tc_ever_v_never/5631035.log
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=64000, mem_mib=61036, disk_mb=1000, disk_mib=954
Conda environments: ignored
Select jobs to execute...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=64000, mem_mib=61036, disk_mb=1000, disk_mib=954
Conda environments: ignored
Select jobs to execute...

[Thu Jul 18 08:48:14 2024]
rule nn_tc_ever_v_never:
    input: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/test0_structures_SMILES_2.smi, /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/30/prior/inputs/train0_structures_SMILES_2.smi
    output: /Genomics/skinniderlab/food-clm/clm/database=FooDB-representation=NA-metric=NA-optimizer=NA-rarefaction=1-enum_factor=30/model_evaluation/30/structures_SMILES_2_nn_tc_ever_v_never.csv.gz
    jobid: 0
    reason: Forced execution
    wildcards: enum_factor=30, dataset=structures, repr=SMILES, fold=2
    resources: mem_mb=64000, mem_mib=61036, disk_mb=1000, disk_mib=954, tmpdir=/tmp, runtime=1000

reading NP model ...
model in
(INFO) (__main__.py) (18-Jul-24 08:49:22) CLM vsrc

I think that this must reflect a missing input in the rule, because whether or not a SMILES has ever been generated cannot be determined until the CLM training, sampling, and post-processing has all occurred.

anushka255 commented 3 months ago

nn_tc_ever_v_never is divided into two rules and the first one is just computing nn-tc between train and test set in a particular fold. This preliminarily rule doesn't depend on any of the generated smiles which is why it runs right after create_training_sets if I'm not mistaken.

The second rule to this step plot_nn_tc_ever_v_ever is where the intersection between rank_file comes to play. And I believe that runs only after all the rules in Snakemake_data have finished running.

skinnider commented 3 months ago

Got it, thanks!