paninski-lab / deepgraphpose

DeepGraphPose
GNU Lesser General Public License v3.0
32 stars 9 forks source link

DGP stuck selecting training frames ['Starting with standard pose-dataset loader.'] #8

Closed obarnstedt closed 3 years ago

obarnstedt commented 3 years ago

Hi, after successfully running DGP on the supplied demo data set, I am now trying to run it on my already trained DLC network, which is fairly large (23 videos, ~1,000 training frames total). Possibly due to these constraints, it seems like the code is getting stuck (silently) at the point of training frame selection (the last thing I see is 'Starting with standard pose-dataset loader'). Before that, I am running the run_dgp_demo.py including the DLC snapshot (running DGP with labeled frames only), and it is successfully initializing ResNet for all the videos. After I saw it being stuck at 'Starting with standard pose-dataset loader' for ~48 hours, I started debugging, to now see that the dataset target counter (dgp/dataset.py, l. 607) is stuck at counter=59 out of nt=113 frames. Specifically, the code is starting to skip more and more frames due to belonging to another video or already having processed these (l. 627 and 632), until it is only skipping. Without rewriting the underlying structure, is there a good work-around, maybe a more straightforward way to select the training frames? Thanks, Oliver

waq1129 commented 3 years ago

Hi Oliver,

Sorry about the bug in the code.

  1. What happens if you change in the function line 121 if self.curr_img == 0 and self.shuffle: to line 121 if self.curr_img == 0:# and self.shuffle:

Basically, try to turn off the shuffle when selecting the next training sample. In that case, you won't run into if frame_idx in frame_idxs: continue

Another question is this while loop should quit when the counter reaches nt=113 in your case. There is no inner loop. So technically it won't get stuck in this while loop forever.

Best, Anqi

obarnstedt commented 3 years ago

Hi Anqi, thanks for the prompt reply! I have in the meantime realised that the problem lies with the idxs obtained from the DLC mat file. Namely, after running DLC training twice previously, it has written many idxs twice into the mat file, so that idxs (obtained from get_frame_idxs_from_train_mat in dataset.py) contained duplicates. I could remove those duplicate indices automatically by adding idxs = list(dict.fromkeys(idxs) before return np.sort(idxs), on line 285 of dataset.py I will try your suggestion after training; but I don't want to interrupt the training that is now happening :) Thanks! Oliver