psteinb / sota_on_uncertainties

trying to obtain uncertainties from training accuracies using timm
BSD 3-Clause "New" or "Revised" License
9 stars 0 forks source link

Training fails with JSONDecodeError #7

Closed zyzzyxdonta closed 2 years ago

zyzzyxdonta commented 2 years ago

This is with CUDA on the cluster, command snakemake -p --profile config/slurm/hemera imagenette2_train. Snakemake itself fails with an error. A bunch of jobs start, then this happens:

Traceback (most recent call last):
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/__init__.py", line 722, in snakemake
    success = workflow.execute(
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/workflow.py", line 1110, in execute
    success = self.scheduler.schedule()
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/scheduler.py", line 421, in schedule
    self._finish_jobs()
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/scheduler.py", line 524, in _finish_jobs
    self.get_executor(job).handle_job_success(job)
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 875, in handle_job_success
    super().handle_job_success(
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 232, in handle_job_success
    job.postprocess(
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/jobs.py", line 1091, in postprocess
    self.dag.workflow.persistence.finished(
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/persistence.py", line 239, in finished
    starttime = self._read_record(self._metadata_path, f).get(
  File "/home/pape58/Code/sota_on_uncertainties/venv/lib/python3.9/site-packages/snakemake/persistence.py", line 423, in _read_record_uncached
    return json.load(f)
  File "/trinity/shared/pkg/devel/python/3.9.6/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/trinity/shared/pkg/devel/python/3.9.6/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/trinity/shared/pkg/devel/python/3.9.6/lib/python3.9/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 1164 (char 1163)

The output right before that (though I'm not sure if it is related or if the results of some other job were collected at that point):

mkdir -p data/imagenette2-320-all/n01440764 && cp -vur data/imagenette2-320/val/n01440764/*JPEG data/imagenette2-320/train/n01440764/*JPEG data/imagenette2-320-all/n01440764
Submitted job 5 with external jobid 'Submitted batch job 4797615'.
psteinb commented 2 years ago

This is something totally unexpected.

psteinb commented 2 years ago

This happened to me now too. However, once I remove the .snakemake folder in the repo root ... the problem is gone again.

zyzzyxdonta commented 2 years ago

This seems to have solved it. Thank you 👍🏻