psteinb / sota_on_uncertainties

trying to obtain uncertainties from training accuracies using timm
BSD 3-Clause "New" or "Revised" License
9 stars 0 forks source link

data directory/DATAROOT unclear #3

Closed zyzzyxdonta closed 2 years ago

zyzzyxdonta commented 2 years ago

Here, a directory data is created and then it is never used. An explanation how to extract the downloaded archive into this directory is missing. I guess you're just missing a cd data?

https://github.com/psteinb/sota_on_uncertainties/blob/e5c9fc7288fbdf011a099b5bfeba82a76f256b98/README.md?plain=1#L96

In the Snakefile, the DATAROOT is set to a directory with your username. This should be set to the directory where the archive was extracted, right?

https://github.com/psteinb/sota_on_uncertainties/blob/e5c9fc7288fbdf011a099b5bfeba82a76f256b98/workflow/Snakefile#L5

zyzzyxdonta commented 2 years ago

Also, why am I even doing the extraction by hand? There is a rule for that in the Snakefile.

psteinb commented 2 years ago

Should be fixed with decda67. Please check and close if needed.

zyzzyxdonta commented 2 years ago

Something still is not quite right 🤔

I successfully ran the default target now (only locally) by running snakemake -j1. This executes 544 steps and finishes with

[Fri Apr 29 14:48:35 2022]
Finished job 0.
544 of 544 steps (100%) done

However, right after that, Snakemake warns me about the following:

The code used to generate one or several output files has changed:
    To inspect which output files have changes, run 'snakemake --list-code-changes'.
    To trigger a re-run, use 'snakemake -R $(snakemake --list-code-changes)'.

snakemake --list-code-changes lists 180 or so files:

results/resnext50/seed42/fold-04/after80/best_metrics.csv
results/resnext50/seed42/fold-08/after80/best_metrics.csv
results/resnext50/seed42/fold-11/after80/best_metrics.csv
results/resnext50/seed42/fold-12/after80/best_metrics.csv
results/resnext50/seed42/fold-14/after80/best_metrics.csv
results/resnext50/seed42/fold-16/after80/best_metrics.csv
results/resnext50/seed42/fold-17/after80/best_metrics.csv
results/resnet50/seed42/fold-05/after80/best_metrics.csv
results/resnet50/seed42/fold-12/after80/best_metrics.csv
...

I'm not really sure what this means. Git doesn't detect any changes in the files. Is this based on the timestamps? I.e. the files were touched/rewritten and have the same contents as before?

If I do what snakemake tells me and run snakemake -R $(snakemake --list-code-changes) it complains

Error: you need to specify the maximum number of CPU cores to be used at the same time. If you want to use N cores, say --cores N or -cN. For all cores on your system (be sure that this is appropriate) use --cores all. For no parallelization use --cores 1 or -c1.

which is a bit unfortunate. So I added the -j1 I had before. Now I get this (weird) message:

MissingRuleException:
No rule to produce as (if you use input functions make sure that they don't raise unexpected exceptions).

Adding the --verbose flag shows that it has something to do with DATAROOT:

Full Traceback (most recent call last):
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib64/python3.10/site-packages/snakemake/__init__.py", line 722, in snakemake
    success = workflow.execute(
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib64/python3.10/site-packages/snakemake/workflow.py", line 795, in execute
    dag.init()
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib64/python3.10/site-packages/snakemake/dag.py", line 184, in init
    self.file2jobs(file),
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib64/python3.10/site-packages/snakemake/dag.py", line 1762, in file2jobs
    raise MissingRuleException(targetfile)
snakemake.exceptions.MissingRuleException: No rule to produce DATAROOT (if you use input functions make sure that they don't raise unexpected exceptions).

MissingRuleException:
No rule to produce DATAROOT (if you use input functions make sure that they don't raise unexpected exceptions).
zyzzyxdonta commented 2 years ago

I sent the comment and immediately realized what the problem with my last step is. You write to stdout:

Using /home/pape58/Code/sota_on_uncertainties/data as DATAROOT

This is also printed by snakemake --list-code-changes. If you remove the print or print to stderr instead, the processing continues.

However, then comes another problem:

[Fri Apr 29 15:27:05 2022]
rule create_tables:
    input: data/imagenette2-320-all/folder.ready
    output: data/imagenette2-320-splits/fold-00.table, data/imagenette2-320-splits/fold-01.table, data/imagenette2-320-splits/fold-02.table, data/imagenette2-320-splits/fold-03.table, data/imagenette2-320-splits/fold-04.table, data/imagenette2-320-splits/fold-05.table, data/imagenette2-320-splits/fold-06.table, data/imagenette2-320-splits/fold-07.table, data/imagenette2-320-splits/fold-08.table, data/imagenette2-320-splits/fold-09.table, data/imagenette2-320-splits/fold-10.table, data/imagenette2-320-splits/fold-11.table, data/imagenette2-320-splits/fold-12.table, data/imagenette2-320-splits/fold-13.table, data/imagenette2-320-splits/fold-14.table, data/imagenette2-320-splits/fold-15.table, data/imagenette2-320-splits/fold-16.table, data/imagenette2-320-splits/fold-17.table, data/imagenette2-320-splits/fold-18.table, data/imagenette2-320-splits/fold-19.table
    jobid: 8
    wildcards: dataset=imagenette2-320
    resources: tmpdir=/tmp

Traceback (most recent call last):
  File "/home/pape58/Code/sota_on_uncertainties/.snakemake/scripts/tmp51f0wypn.kfold.py", line 87, in <module>
    value = main(snakemake.input[0], opath.parent)
  File "/home/pape58/Code/sota_on_uncertainties/.snakemake/scripts/tmp51f0wypn.kfold.py", line 70, in main
    tablefiles = write_tables(inpath, outputdir, kfolds, seed)
  File "/home/pape58/Code/sota_on_uncertainties/.snakemake/scripts/tmp51f0wypn.kfold.py", line 37, in write_tables
    for train_index, test_index in skf.split(X, y):
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib64/python3.10/site-packages/sklearn/model_selection/_split.py", line 747, in split
    y = check_array(y, ensure_2d=False, dtype=None)
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib64/python3.10/site-packages/sklearn/utils/validation.py", line 805, in check_array
    raise ValueError(
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
[Fri Apr 29 15:27:06 2022]
Error in rule create_tables:
    jobid: 8
    output: data/imagenette2-320-splits/fold-00.table, data/imagenette2-320-splits/fold-01.table, data/imagenette2-320-splits/fold-02.table, data/imagenette2-320-splits/fold-03.table, data/imagenette2-320-splits/fold-04.table, data/imagenette2-320-splits/fold-05.table, data/imagenette2-320-splits/fold-06.table, data/imagenette2-320-splits/fold-07.table, data/imagenette2-320-splits/fold-08.table, data/imagenette2-320-splits/fold-09.table, data/imagenette2-320-splits/fold-10.table, data/imagenette2-320-splits/fold-11.table, data/imagenette2-320-splits/fold-12.table, data/imagenette2-320-splits/fold-13.table, data/imagenette2-320-splits/fold-14.table, data/imagenette2-320-splits/fold-15.table, data/imagenette2-320-splits/fold-16.table, data/imagenette2-320-splits/fold-17.table, data/imagenette2-320-splits/fold-18.table, data/imagenette2-320-splits/fold-19.table

RuleException:
CalledProcessError in line 264 of /home/pape58/Code/sota_on_uncertainties/workflow/Snakefile:
Command 'set -euo pipefail;  /home/pape58/Code/sota_on_uncertainties/venv/bin/python /home/pape58/Code/sota_on_uncertainties/.snakemake/scripts/tmp51f0wypn.kfold.py' returned non-zero exit status 1.
  File "/home/pape58/Code/sota_on_uncertainties/workflow/Snakefile", line 264, in __rule_create_tables
  File "/usr/lib64/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

(I'm using Python 3.10 if it matters, and scikit-learn==1.0.2; also I had to change the torch versions to work with Python 3.10 to torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0+cpu)

psteinb commented 2 years ago

This (and some other things) have been fixed with f546219. If you find the time, please check again and close the issue if possible.

zyzzyxdonta commented 2 years ago

Great! The re-run as suggested by Snakemake works now! 😄 However you still need to write the DATAROOT stuff to stderr as it is also printed by snakemake --list-code-changes:

from sys import stderr
print(f"Using {DATAROOT.absolute()} as DATAROOT", file=stderr)

Here:

https://github.com/psteinb/sota_on_uncertainties/blob/f54621920e89bbe81cfb5814751c056580c87ab4/workflow/Snakefile#L6

And another thing: Would it be feasible to run the imagenette2_unpack job before the training starts. That way, users wouldn't need to unpack it manually. The job exists so it might as well be used 🤷🏻‍♂️

psteinb commented 2 years ago

rewiring the print to stderr came with 2746449

psteinb commented 2 years ago

related to:

And another thing: Would it be feasible to run the imagenette2_unpack job before the training starts. That way, users wouldn't need to unpack it manually. The job exists so it might as well be used

I just pushed a small fix with fce3269, could you please check.

zyzzyxdonta commented 2 years ago

Something is still not right 🙈

[Wed May  4 09:10:10 2022]
rule imagenette_unpack:
    input: /home/pape58/Code/sota_on_uncertainties/data/imagenette2-320.tgz
    output: data/imagenette2-320/train, data/imagenette2-320/val, data/imagenette2-320/train/n01440764, data/imagenette2-320/train/n02102040, data/imagenette2-320/train/n02979186, data/imagenette2-320/train/n03000684, data/imagenette2-320/train/n03028079, data/imagenette2-320/train/n03394916, data/imagenette2-320/train/n03417042, data/imagenette2-320/train/n03425413, data/imagenette2-320/train/n03445777, data/imagenette2-320/train/n03888257, data/imagenette2-320/val/n01440764, data/imagenette2-320/val/n02102040, data/imagenette2-320/val/n02979186, data/imagenette2-320/val/n03000684, data/imagenette2-320/val/n03028079, data/imagenette2-320/val/n03394916, data/imagenette2-320/val/n03417042, data/imagenette2-320/val/n03425413, data/imagenette2-320/val/n03445777, data/imagenette2-320/val/n03888257
    jobid: 0
    resources: mem_mb=30000, disk_mb=1000, tmpdir=/tmp, cpus=4, time_min=75, ngpu=1

cd data && tar xf /home/pape58/Code/sota_on_uncertainties/data/imagenette2-320.tgz
ImproperOutputException in line 241 of /home/pape58/Code/sota_on_uncertainties/workflow/Snakefile:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule imagenette_unpack:
    output: data/imagenette2-320/train, data/imagenette2-320/val, data/imagenette2-320/train/n01440764, data/imagenette2-320/train/n02102040, data/imagenette2-320/train/n02979186, data/imagenette2-320/train/n03000684, data/imagenette2-320/train/n03028079, data/imagenette2-320/train/n03394916, data/imagenette2-320/train/n03417042, data/imagenette2-320/train/n03425413, data/imagenette2-320/train/n03445777, data/imagenette2-320/train/n03888257, data/imagenette2-320/val/n01440764, data/imagenette2-320/val/n02102040, data/imagenette2-320/val/n02979186, data/imagenette2-320/val/n03000684, data/imagenette2-320/val/n03028079, data/imagenette2-320/val/n03394916, data/imagenette2-320/val/n03417042, data/imagenette2-320/val/n03425413, data/imagenette2-320/val/n03445777, data/imagenette2-320/val/n03888257
    affected files:
        data/imagenette2-320/train/n01440764
Removing output files of failed job imagenette_unpack since they might be corrupted:
data/imagenette2-320/train, data/imagenette2-320/val, data/imagenette2-320/train/n01440764, data/imagenette2-320/train/n02102040, data/imagenette2-320/train/n02979186, data/imagenette2-320/train/n03000684, data/imagenette2-320/train/n03028079, data/imagenette2-320/train/n03394916, data/imagenette2-320/train/n03417042, data/imagenette2-320/train/n03425413, data/imagenette2-320/train/n03445777, data/imagenette2-320/train/n03888257, data/imagenette2-320/val/n01440764, data/imagenette2-320/val/n02102040, data/imagenette2-320/val/n02979186, data/imagenette2-320/val/n03000684, data/imagenette2-320/val/n03028079, data/imagenette2-320/val/n03394916, data/imagenette2-320/val/n03417042, data/imagenette2-320/val/n03425413, data/imagenette2-320/val/n03445777, data/imagenette2-320/val/n03888257
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
psteinb commented 2 years ago

Just did this on a fresh pull from this repo after 544b64b:

snakemake -j10 --profile config/slurm/hemera imagenette_tables

This runs through without problems:

[Wed May  4 16:38:29 2022]
Finished job 0.
15 of 15 steps (100%) done
Complete log: .snakemake/log/2022-05-04T163707.716612.snakemake.log
zyzzyxdonta commented 2 years ago

With 544b64b4c7ef29d384f5c91e07e24fdf99cd902e, the target imagenette_tables works for me now, too! However, I'm now experiencing the next issue 🙈

[Thu May  5 09:47:55 2022]
rule imagenette2_resnet50_default:
    input: data/imagenette2-320-splits/fold-19
    output: outputs/resnet50/seed1331/fold-19, outputs/resnet50/seed1331/fold-19/after80/last.pth.tar, outputs/resnet50/seed1331/fold-19/after80/model_best.pth.tar
    log: outputs/resnet50/seed1331/fold-19.log
    jobid: 0
    wildcards: seedval=1331, foldstem=fold-19
    resources: mem_mb=30000, disk_mb=1000, tmpdir=/tmp, cpus=4, time_min=75, ngpu=1, partition=gpu

time python timm-0.5.4-train.py data/imagenette2-320-splits/fold-19 --seed 1331 --model resnet50 --num-classes=10 --output outputs/resnet50/seed1331/fold-19 --checkpoint-hist 2 --epochs 80 --experiment after80 > outputs/resnet50/seed1331/fold-19.log 2>&1
[Thu May  5 09:48:07 2022]
Error in rule imagenette2_resnet50_default:
    jobid: 0
    output: outputs/resnet50/seed1331/fold-19, outputs/resnet50/seed1331/fold-19/after80/last.pth.tar, outputs/resnet50/seed1331/fold-19/after80/model_best.pth.tar
    log: outputs/resnet50/seed1331/fold-19.log (check log file(s) for error message)
    shell:
        time python timm-0.5.4-train.py data/imagenette2-320-splits/fold-19 --seed 1331 --model resnet50 --num-classes=10 --output outputs/resnet50/seed1331/fold-19 --checkpoint-hist 2 --epochs 80 --experiment after80 > outputs/resnet50/seed1331/fold-19.log 2>&1
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

A lot more jobs report similar this. 2022-05-05T095800.214452.snakemake.log.txt

zyzzyxdonta commented 2 years ago

This seems to work now, after removing .snakemake.