Cannot fine-tune with ImageNet

jlshin commented 3 years ago

I read through a few of the closed and open issues and I am observing an issue similar to #9.

Setup I am trying to work through the examples listed in the README with the ImageNet data (I followed the link to download), set up the paths accordingly and have not changed anything aside from renaming the filepaths in the pickle files downloaded from the google drive (voc_inloc_roidb.pkl and voc_inloc_gt_roidb_.pkl). Aside from renaming paths, I am using the version of few_shot_benchmark.py and the config file that is currently on the master branch.

Questions

What was the solution for issue #9?
Is the fine-tuning example intended to work with the settings currently in the /experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml?
Specifically, is balance_classes supposed to be set to false when fine-tuning with episodic data (it is set to true in the config)?

Issue I am encountering the out of index error that was observed in #9 and am confused by the discussion on that thread. Here is what I have tried to run

 python fpn/few_shot_benchmark.py \
--test_name=RepMet_inloc \
--Nshot=1 --Nway=5 --Nquery_cat=10 --Nepisodes=500 \
--do_finetune=1 --num_finetune_epochs=5 --lr=5e-4

Namely, I am not sure I understand this comment that @jshtok made (or if it is relevant to the solution):

You mentioned (in the email version of this ticket) you have the setting cfg.dataset.NUM_CLASSES=127 but in the .yaml configuration file it is set dataset: NUM_CLASSES: 122 so the NUM_CLASSES should be 122. I don't expect this error to happen if the NUM_CLASSES is correct, please check if this is the case.

The NUM_CLASSES is changed from 122 (from the YAML file) to 127 when add_reps_to_model is called and new_cats_to_beginning is hard-coded to be False so unless these parameters are intended to be set to different values, it does not surprise me that NUM_CLASSES = 127 (122 + Nway) here.

https://github.com/jshtok/RepMet/blob/9bdc3f20ff08a8b3ce005af327aba6bf0bb71213/fpn/few_shot_benchmark.py#L661-L663

Here is what I have noticed when trying to debug this issue:

It looks like the issue comes about when balance_classes is set to True in the configuration yaml file. We enter into the balance_classes method in the PyramidAnchorIterator class. This ends up excluding all the examples within my first batch resulting in self.size to be 0 https://github.com/jshtok/RepMet/blob/9bdc3f20ff08a8b3ce005af327aba6bf0bb71213/fpn/core/loader.py#L268-L309
- Since self.size is now 0, self.cur_to is also 0 resulting in a length 0 slice of the roidb https://github.com/jshtok/RepMet/blob/9bdc3f20ff08a8b3ce005af327aba6bf0bb71213/fpn/core/loader.py#L424-L426
When we get further down to try to index roidb, index 0 ends up being invalid because the list is empty, which results in the error raised in #9 https://github.com/jshtok/RepMet/blob/9bdc3f20ff08a8b3ce005af327aba6bf0bb71213/fpn/core/loader.py#L443-L444
It seems like since we are fine-tuning using episodic data the balance_classes parameter seems redundant, so I have also tried setting this value to False, which avoids the index issue...but other issues arise

NaN values in _filter_boxes !
Error in CustomOp.forward: Traceback (most recent call last):
  File "/home/user/.local/lib/python2.7/site-packages/mxnet/operator.py", line 789, in forward_entry
    aux=tensors[4])
  File "fpn/operator_py/proposal_target.py", line 57, in forward
    assert np.all(all_rois[:, 0] == 0), 'Only single item batches are supported'
AssertionError: Only single item batches are supported

(I emailed @jshtok briefly about this a few weeks ago as I was seeing similar issues when trying to fine-tune on my own data. I did not resolve the issue and figured I'd try to get this up and running on ImageNet first and am running into similar issues).

Any ideas on how to fix this?

jlshin commented 3 years ago

Thanks for responding to my email, definitely understand it taking a few days to get to this and appreciate the help. I thought it might also be worth mentioning that I also cannot reproduce the no fine-tuning results from the paper (Table 3). I am getting much worse results (near 0 AP) for each episode.

I have been comparing my performance to the file posted in this comment: https://github.com/jshtok/RepMet/issues/22#issuecomment-579112508 for the 5-shot 5-way task and I see that I am using the correct data (# GT examples per episode are the same), but the Recall and AP is very far off.

Any chance the fpn_pascal_imagenet-0015.params file accidentally got changed in the google drive?

jlshin commented 3 years ago

I resolved the issues I was observing. I am going to leave this issue open because I still believe that in order to fine-tune you need to change the balance_classes parameter in the config (yaml) file to false, otherwise the issue with the index error above is raised.

I have found other users' comments helpful so I will return the favor:

The main issue was an incompatibility between my GPU and cuda 8. I upgraded mxnet 1.0.0 to use cuda 9 and am able to fine-tune and train
- Additionally, this involves changing the lib/nms/gpu_nms.so file to lib/nms/gpu_nms_9.so

jshtok commented 3 years ago

Hi Joanne,

Thank you for letting me know. I will check how the balance_classes() causes the error, maybe it needs fixing; anyway, it is seldom relevant. I am glad you solved the issue.

Best Regards, Joseph

On Sat, Dec 5, 2020 at 2:33 AM Joanne Shin notifications@github.com wrote:

I resolved the issues I was observing. I am going to leave this issue open because I still believe that in order to fine-tune you need to change the balance_classes parameter in the config (yaml) file to false, otherwise the issue with the index error above is raised.

I have found other users' comments helpful so I will return the favor:

The main issue was an incompatibility between my GPU and cuda 8. I upgraded mxnet 1.0.0 to use cuda 9 and am able to fine-tune and train

Additionally, this involves changing the lib/nms/gpu_nms.so file to lib/nms/gpu_nms_9.so

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/33#issuecomment-739090891, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOBU6WYP4PSHISXSGS3BWDSTF5WJANCNFSM4UJVFU7A .

jshtok / RepMet

Cannot fine-tune with ImageNet #33