Open bushra1100 opened 6 years ago
try actual path of the file instead `last, maybe last model is broken try another one
@bilgan by the actual path you mean the weight files or the checkpoints file? like in balloon.py case "path/to/balloon_maskrcnn_weights_epoch16.h5" files are saved in logs. should I try and load the last saved h5 file or the "balloon-checkpoint.ipynb" file from ".ipynb_checkpoints" folder?
.h5 file, but not last one
ok. so if I trained like 16 epochs than I have to start again from the .h5 file of 15th right?
From: Billy Ganzorig notifications@github.com Sent: Thursday, September 6, 2018 2:25:36 PM To: matterport/Mask_RCNN Cc: bushra1100; Author Subject: Re: [matterport/Mask_RCNN] error when trying to load last trained model (#907)
.h5 file, but not last one
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/matterport/Mask_RCNN/issues/907#issuecomment-419026958, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhUN0L4oSC9nrPXYS5YcBMkoSLeyvVuRks5uYOoQgaJpZM4WatdN.
yes, thats right
I tried to load second last and even third and fourth last weight files by this command:
python person.py train --dataset=E:/MASK/Mask_RCNN\samples\balloon\person_dataset\fused --weights=coco
--logs=E:\MASK\Mask_RCNN\logs\person20180906T2305\mask_rcnn_person_002.h5
but it's giving me this error:
FileNotFoundError: [WinError 3] The system cannot find the path specified: 'E:\MASK\Mask_RCNN\logs\person20180906T2305\mask_rcnn_person_002.h5\person20180907T0403'
From: Billy Ganzorig notifications@github.com Sent: Thursday, September 6, 2018 2:52:59 PM To: matterport/Mask_RCNN Cc: bushra1100; Author Subject: Re: [matterport/Mask_RCNN] error when trying to load last trained model (#907)
yes, thats right
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/matterport/Mask_RCNN/issues/907#issuecomment-419035181, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AhUN0DkAjMJQBldCcNjLhb7sf-iByWGxks5uYPB7gaJpZM4WatdN.
i have changed my command and now it working but it started from very first epoch. why? and what should i do to start from where i left?
python person.py train --dataset=E:\MASK\Mask_RCNN\samples\balloon\person_dataset\fused --weights=E:\MASK\Mask_RCNN\logs\person20180906T2305\mask_rcnn_person_0006.h5
Loading weights E:\MASK\Mask_RCNN\logs\person20180906T2305\mask_rcnn_person_0006.h5 2018-09-10 01:36:11.781636: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 Training network heads
Starting at epoch 0. LR=0.001
Checkpoint Path: E:\MASK\Mask_RCNN\logs\person20180910T0136\mask_rcnnperson{epoch:04d}.h5 Selecting layers to train fpn_c5p5 (Conv2D) fpn_c4p4 (Conv2D) fpn_c3p3 (Conv2D) fpn_c2p2 (Conv2D) fpn_p5 (Conv2D) fpn_p2 (Conv2D) fpn_p3 (Conv2D) fpn_p4 (Conv2D) In model: rpn_model rpn_conv_shared (Conv2D) rpn_class_raw (Conv2D) rpn_bbox_pred (Conv2D) mrcnn_mask_conv1 (TimeDistributed) mrcnn_mask_bn1 (TimeDistributed) mrcnn_mask_conv2 (TimeDistributed) mrcnn_mask_bn2 (TimeDistributed) mrcnn_class_conv1 (TimeDistributed) mrcnn_class_bn1 (TimeDistributed) mrcnn_mask_conv3 (TimeDistributed) mrcnn_mask_bn3 (TimeDistributed) mrcnn_class_conv2 (TimeDistributed) mrcnn_class_bn2 (TimeDistributed) mrcnn_mask_conv4 (TimeDistributed) mrcnn_mask_bn4 (TimeDistributed) mrcnn_bbox_fc (TimeDistributed) mrcnn_mask_deconv (TimeDistributed) mrcnn_class_logits (TimeDistributed) mrcnn_mask (TimeDistributed) C:\Users\ETL\AppData\Local\Programs\Python\Python35\lib\site-packages\tensorflow\python\ops\gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " Epoch 1/30 C:\Users\ETL\AppData\Local\Programs\Python\Python35\lib\site-packages\skimage\transform_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images. warn("Anti-aliasing will be enabled by default in skimage 0.15 to " C:\Users\ETL\AppData\Local\Programs\Python\Python35\lib\site-packages\skimage\transform_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images. warn("Anti-aliasing will be enabled by default in skimage 0.15 to " C:\Users\ETL\AppData\Local\Programs\Python\Python35\lib\site-packages\skimage\transform_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15 to avoid aliasing artifacts when down-sampling images. warn("Anti-aliasing will be enabled by default in skimage 0.15 to " 1/10 [==>...........................] - ETA: 12:17 - loss: 0.4553 - rpn_class_loss: 0.0019 - rpn_bbox_loss: 0.0221 - mrcnn_class_loss: 0.0336 - mrcnn_bbox_loss: 0.1120 - mrcnn_mask_loss: 0.2858C:\Users\ETL\AppData\Local\Programs\Python\Python35\lib\site-packages\skimage\transform_warps.py:110: UserWarning: Anti-aliasing will be enabled by default in skimage 0.15
this issue is solved: just load the .h5 file by giving the exact path to logs file as suggested by @bilgan and then go to model.py and reset the self.epoch to desired epoch or the epoch where your training discontinued.
for example: my training was disrupted ate epoch=16 I'll change my self.epoch in main.py to self.epoch=15 (15 because the last file might be corrupt) then I'll go to logs folder in my Mask_RCNN directory and copy the path to 15th epoch in command prompt like this:
python person.py train --dataset=E:\MASK\Mask_RCNN\samples\balloon\person_dataset\fused --weights=E:\MASK\Mask_RCNN\logs\person20180906T2305\mask_rcnn_person_0015.h5
error: Unrecognized arguments: --model=last
@bushra1100 That error is odd. Could it be that you changed the code on your side in such a way that removes the "last" option? Try it on coco.py
or balloon.py
to isolate the problem.
i am having some issue due to which my PC turns off during training, and according to tutorial i ama trying to load the last trained model to continue the training but the following error occured.
error: Unrecognized arguments: --model=last
while running both of these commands (i am working on person data set)
python person.py train --dataset=/path/to/dataset --model=last --weights=coco python person.py train --dataset=/path/to/dataset --model=last