jshtok / RepMet

Few-shot detection for visual categories
Apache License 2.0
110 stars 18 forks source link

Cannot fine-tune with ImageNet #33

Closed jlshin closed 2 years ago

jlshin commented 3 years ago

I read through a few of the closed and open issues and I am observing an issue similar to #9.

Setup I am trying to work through the examples listed in the README with the ImageNet data (I followed the link to download), set up the paths accordingly and have not changed anything aside from renaming the filepaths in the pickle files downloaded from the google drive (voc_inloc_roidb.pkl and voc_inloc_gt_roidb_.pkl). Aside from renaming paths, I am using the version of few_shot_benchmark.py and the config file that is currently on the master branch.

Questions

Issue I am encountering the out of index error that was observed in #9 and am confused by the discussion on that thread. Here is what I have tried to run

 python fpn/few_shot_benchmark.py \
--test_name=RepMet_inloc \
--Nshot=1 --Nway=5 --Nquery_cat=10 --Nepisodes=500 \
--do_finetune=1 --num_finetune_epochs=5 --lr=5e-4

Namely, I am not sure I understand this comment that @jshtok made (or if it is relevant to the solution):

You mentioned (in the email version of this ticket) you have the setting cfg.dataset.NUM_CLASSES=127 but in the .yaml configuration file it is set dataset: NUM_CLASSES: 122 so the NUM_CLASSES should be 122. I don't expect this error to happen if the NUM_CLASSES is correct, please check if this is the case.

The NUM_CLASSES is changed from 122 (from the YAML file) to 127 when add_reps_to_model is called and new_cats_to_beginning is hard-coded to be False so unless these parameters are intended to be set to different values, it does not surprise me that NUM_CLASSES = 127 (122 + Nway) here.

https://github.com/jshtok/RepMet/blob/9bdc3f20ff08a8b3ce005af327aba6bf0bb71213/fpn/few_shot_benchmark.py#L661-L663

Here is what I have noticed when trying to debug this issue:

NaN values in _filter_boxes !
Error in CustomOp.forward: Traceback (most recent call last):
  File "/home/user/.local/lib/python2.7/site-packages/mxnet/operator.py", line 789, in forward_entry
    aux=tensors[4])
  File "fpn/operator_py/proposal_target.py", line 57, in forward
    assert np.all(all_rois[:, 0] == 0), 'Only single item batches are supported'
AssertionError: Only single item batches are supported

(I emailed @jshtok briefly about this a few weeks ago as I was seeing similar issues when trying to fine-tune on my own data. I did not resolve the issue and figured I'd try to get this up and running on ImageNet first and am running into similar issues).

Any ideas on how to fix this?

jlshin commented 3 years ago

Thanks for responding to my email, definitely understand it taking a few days to get to this and appreciate the help. I thought it might also be worth mentioning that I also cannot reproduce the no fine-tuning results from the paper (Table 3). I am getting much worse results (near 0 AP) for each episode.

I have been comparing my performance to the file posted in this comment: https://github.com/jshtok/RepMet/issues/22#issuecomment-579112508 for the 5-shot 5-way task and I see that I am using the correct data (# GT examples per episode are the same), but the Recall and AP is very far off.

Any chance the fpn_pascal_imagenet-0015.params file accidentally got changed in the google drive?

jlshin commented 3 years ago

I resolved the issues I was observing. I am going to leave this issue open because I still believe that in order to fine-tune you need to change the balance_classes parameter in the config (yaml) file to false, otherwise the issue with the index error above is raised.

I have found other users' comments helpful so I will return the favor:

jshtok commented 3 years ago

Hi Joanne,

Thank you for letting me know. I will check how the balance_classes() causes the error, maybe it needs fixing; anyway, it is seldom relevant. I am glad you solved the issue.

Best Regards, Joseph

On Sat, Dec 5, 2020 at 2:33 AM Joanne Shin notifications@github.com wrote:

I resolved the issues I was observing. I am going to leave this issue open because I still believe that in order to fine-tune you need to change the balance_classes parameter in the config (yaml) file to false, otherwise the issue with the index error above is raised.

I have found other users' comments helpful so I will return the favor:

  • The main issue was an incompatibility between my GPU and cuda 8. I upgraded mxnet 1.0.0 to use cuda 9 and am able to fine-tune and train
    • Additionally, this involves changing the lib/nms/gpu_nms.so file to lib/nms/gpu_nms_9.so

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/33#issuecomment-739090891, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOBU6WYP4PSHISXSGS3BWDSTF5WJANCNFSM4UJVFU7A .