jpsember / java-ml

Java classes for machine learning
0 stars 0 forks source link

Race condition with missing files #61

Open jpsember opened 2 years ago

jpsember commented 2 years ago
Epoch 6383   Train Loss: 0.209 (0.139)
Epoch 6384   Train Loss: 0.200 (0.128)
LogProcessor caught exception: FileException, File '/home/eio/js_dep/ml/example_yolo/train_data/set_942' does not exist;
 .. FileException.withCause:40
 .. Files.asFileException:543
 .. Files.readString:175
 .. LogProcessor.parseLogItem:117
 .. LogProcessor.auxRun:80
 .. LogProcessor.run:67
 .. Thread.run:748
 ...caused by...
FileNotFoundException, File '/home/eio/js_dep/ml/example_yolo/train_data/set_942' does not exist;
 .. FileUtils.openInputStream:297
 .. FileUtils.readFileToString:1805
 .. Files.readString:173
 .. LogProcessor.parseLogItem:117
 .. LogProcessor.auxRun:80
 .. LogProcessor.run:67
 .. Thread.run:748
Saving checkpoint: checkpoints/006387.pt
...quitting training session, reason: Stop signal received
jpsember commented 2 years ago

Strange.... it is trying to parse a file that doesn't have a json extension?

jpsember commented 2 years ago

Maybe another thread is deleting the directory before the log thread has a chance to process all the log files within it?

jpsember commented 2 years ago

Happened again:

Epoch: 3089  Loss:  6.04  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3090  Loss:  6.04  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3091  Loss:  6.04  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
*** failed to parseLogItem, file:
    { "1 status" : "MISSING",
        "2 name" : "set_1029",
      "3 parent" : "/home/eio/js_dep/ml/example_yolo/train_data",
         "4 abs" : "/home/eio/js_dep/ml/example_yolo/train_data/set_1029"
    }
logDir:
    { "1 status" : "DIRECTORY",
         "2 rel" : "train_data",
        "3 cdir" : "/home/eio/js_dep/ml/example_yolo",
         "4 abs" : "/home/eio/js_dep/ml/example_yolo/train_data"
    }
exception: FileException, File '/home/eio/js_dep/ml/example_yolo/train_data/set_1029' does not exist;
 .. FileException.withCause:40
 .. Files.asFileException:543
 .. Files.readString:175
 .. LogProcessor.parseLogItem:139
 .. LogProcessor.auxRun:94
 .. LogProcessor.run:65
 .. Thread.run:748
 ...caused by...
FileNotFoundException, File '/home/eio/js_dep/ml/example_yolo/train_data/set_1029' does not exist;
 .. FileUtils.openInputStream:297
 .. FileUtils.readFileToString:1805
 .. Files.readString:173
 .. LogProcessor.parseLogItem:139
 .. LogProcessor.auxRun:94
 .. LogProcessor.run:65
 .. Thread.run:748
Epoch: 3092  Loss:  6.03  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3093  Loss:  6.03  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3094  Loss:  6.03  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3095  Loss:  6.03  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3096  Loss:  6.03  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00
Epoch: 3097  Loss:  6.03  Loss_class:  0.00  Loss_obj_f:  0.01  Loss_obj_t:  0.08  Loss_wh:  1.19  Loss_xy:  0.00