training code - Githubissues

liyangliu commented 4 years ago

Hi, @jshtok I find there is only testing code in your repository, could you please add usage of your code about how to train your model? thanks.

jshtok commented 4 years ago

Hi Liyang, Thank you for pointing this out. The code for full training is there, I will produce instructions for its use shortly.

jshtok commented 4 years ago

Hi, the training code should be working now, please follow instructions in the readme.md. Please comment here if any issues come up.

liyangliu commented 4 years ago

Thank you for your reply.

I tried to train your model, but can not find some files needed for training such as "Imagenet_LOC/inloc_cls2id_map.pkl", "ILSVRC/ImageSets/CLS-LOC/train_loc.txt", etc. Would you mind adding files needed for training? Thanks.

Also, there lacks the pretrained model "model/pretrained_model/resnet_v1_101-0000.params". BTW, have you trained your models without advanced settings such as Deformable Conv and Deformable RoIAlign ?

jshtok commented 4 years ago

Thank you for your reply.

I tried to train your model, but can not find some files needed for training such as "Imagenet_LOC/inloc_cls2id_map.pkl", "ILSVRC/ImageSets/CLS-LOC/train_loc.txt", etc. Would you mind adding files needed for training? Thanks.

Also, there lacks the pretrained model "model/pretrained_model/resnet_v1_101-0000.params". BTW, have you trained your models without advanced settings such as Deformable Conv and Deformable RoIAlign ?

Hi, you need to replace the paths '/dccstor/leonidka1/data/VOCdevkit' and '/dccstor/leonidka1/data/imagenet/ILSVRC' with your local pathes to the Pascal VOC and Imagenet, respectively. You're need not just train_loc.txt, but the whole dataset.

The pretrained model 'resnet_v1_101-0000.params' be available in the 'data' folder where the rest of data files were placed.

I will get back to you regarding reduced architecture experiments.

jshtok commented 4 years ago

Hi Liyang,

We have not try to train the model without the deformable components. It is indeed interesting to do the ablation study and see how much does the DCN contribute to the detector.

liyangliu commented 4 years ago

I find that in inloc_cls2id_map.pkl there are only 999 classes (crane No.429 class do not exist), I wonder why is the case. Thank you.

jshtok commented 4 years ago

Hi, The cache file weights 2.5Gb, and you'd better off being able to generate you own. Did you try downloading the Imagenet-LOC dataset? It should contain the annotations. What link did you try to download?

On Sat, Sep 14, 2019 at 9:35 AM liyang notifications@github.com wrote:

Hi, @jshtok https://github.com/jshtok, I am trying to reproduce your experiments without advanced settings like DCN, but I find that I only have ImageNet dataset for classification and don't have the Annotations for localization task of ILSVRC, would you mind uploading a copy of your generated cached file named 'data/cache/imagenet_clsloc_train_loc_gt_roidb.pkl' during training? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/3?email_source=notifications&email_token=ACOBU6WKZJE4U6GBJABZ2WLQJSAZHA5CNFSM4IUWENA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6WVVOQ#issuecomment-531454650, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOBU6RXMVSAB53J76EQ54TQJSAZHANCNFSM4IUWENAQ .

jshtok commented 4 years ago

I find that in inloc_cls2id_map.pkl there are only 999 classes (crane No.429 class do not exist), I wonder why is the case. Thank you.

Hi, I am not sure what is the issue here. It is possible we didn't use this class in our experiments so it did not affect us. Do you use our lists of classes? Is crane in the test set? Anyway, please add crane to its position if this solves the problem.

liyangliu commented 4 years ago

Thanks, I have generate the roidb cache file from your generated voc_inloc_roidb.pkl. And I think missing of crane do not affect training and testing, because you only use the first 101 classes (and 20 classes in VOC) for training and 214 classes for testing.

jshtok commented 4 years ago

Hi, Please notice that the 214 categories is not the continuous list [102:215] but is a selection of classes from throughout the ImageNet. Keep me up to date on your progress with model training. In the log file, keep tabs on the FG_accuracy measure, it should reach ~60%-70% Please close the issue if the training works.

On Sat, Sep 14, 2019 at 3:50 PM liyang notifications@github.com wrote:

Thanks, I have generate the roidb cache file from your generated voc_inloc_roidb.pkl. And I think missing of crane do not affect training and testing, because you only use the first 101 classes (and 20 classes in VOC) for training and 214 classes for testing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/3?email_source=notifications&email_token=ACOBU6QMI54I6I5M3I7XEJLQJTM2FA5CNFSM4IUWENA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6W3GPY#issuecomment-531477311, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOBU6XMUVUZGOW5HHX3L5TQJTM2FANCNFSM4IUWENAQ .

liyangliu commented 4 years ago

I am now training the models and logs are as follows. Does it seem normal?

I just extract the 'roidb' from voc_inloc_roidb.pkl to act as the cached imagenet_clsloc_train_loc_gt_roidb.pkl, is it correct? Or do I need to add 21 to every gt_classes because it is only the class id without offset.

jshtok commented 4 years ago

Yes, this looks fine. Your speed is quite high - probably you've got a strong GPU. FG accuracy is expected to rise to 2-digits in few epochs.

On Sat, Sep 14, 2019 at 4:30 PM liyang notifications@github.com wrote:

I am now training the models and logs are as follows. [image: image] https://user-images.githubusercontent.com/5159489/64908777-c41dbd80-d736-11e9-87c5-b9707977adab.png Does it seem normal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/3?email_source=notifications&email_token=ACOBU6TNKHILJIBMJPRDP4DQJTRNXA5CNFSM4IUWENA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6W33BQ#issuecomment-531479942, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOBU6WLG676GYSSP23SDX3QJTRNXANCNFSM4IUWENAQ .

jshtok commented 4 years ago

Hi Liyang,

I would check if voc_inloc_roidb.pkl contains the images from 101 training classes, or only the 214 test classes. I don't remember which is the case. Make sure you're training for the correct set of classes, otherwise your few-shot tests will be too good :)

Regards, Joseph

On Sat, Sep 14, 2019 at 9:37 PM Joseph Shtok jshtok@gmail.com wrote:

Yes, this looks fine. Your speed is quite high - probably you've got a strong GPU. FG accuracy is expected to rise to 2-digits in few epochs.

On Sat, Sep 14, 2019 at 4:30 PM liyang notifications@github.com wrote:

I am now training the models and logs are as follows. [image: image] https://user-images.githubusercontent.com/5159489/64908777-c41dbd80-d736-11e9-87c5-b9707977adab.png Does it seem normal?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/3?email_source=notifications&email_token=ACOBU6TNKHILJIBMJPRDP4DQJTRNXA5CNFSM4IUWENA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6W33BQ#issuecomment-531479942, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOBU6WLG676GYSSP23SDX3QJTRNXANCNFSM4IUWENAQ .

liyangliu commented 4 years ago

I notice that voc_inloc_roidb.pkl contains all the 544546 images from 1000 classes, and in your code you have used the inloc_first101_categories.txt to filter the first 101 classes for training. Indeed I need to add 20 to each class from ImagetNet classes because the first 20 classes are from the VOC dataset. Actually my model training is not right yesterday because I didn't add 20. Now the training process looks more normal as follows. The R-CNN FG Accuracy is now around 40% after 5 epochs.

liyangliu commented 4 years ago

But why is there only 2520 batches per epoch? I notice total number of images from the first 101 classes in Imagnet dataset are 53812, so when I use 8 gpus and 1 image per gpu, there should be 6726 batches. Do you use all the images for training each epoch, or just sample some images from the whole dataset according to class balanced sampling?

liyangliu commented 4 years ago

I find there may be some bugs because you use both VOC classes and ImageNet classes for training. I thought you should consider VOC and ImageNet classes as different ones, and in the construction of ImagetNet imdb you actually do it right, every ImageNet class is added 20 as class offset like follows https://github.com/jshtok/RepMet/blob/22aec096cd89a06d008fdd5b335a816545bdf073/lib/dataset/imagenet.py#L74 https://github.com/jshtok/RepMet/blob/22aec096cd89a06d008fdd5b335a816545bdf073/lib/dataset/imagenet.py#L241-L249 But in the filtering of training classes I think you may forget the mentioned offset and do the following https://github.com/jshtok/RepMet/blob/22aec096cd89a06d008fdd5b335a816545bdf073/fpn/core/loader.py#L289-L296 Notice that here you filter classes whose index is not in the range [1, 101] (both sides included). But the actual n-th class in ImageNet has been mapped to n+20 using class offset. So here you should filter out classes not in the range of [1, 121]. Maybe I am not right, but I think there are some mismatch between different parts of the code.

Lastly, why do you include VOC classes for training? I think the code is right if we train without VOC classes.

jshtok commented 4 years ago

But why is there only 2520 batches per epoch? I notice total number of images from the first 101 classes in Imagnet dataset are 53812, so when I use 8 gpus and 1 image per gpu, there should be 6726 batches. Do you use all the images for training each epoch, or just sample some images from the whole dataset according to class balanced sampling?

Hi, there is an argument named DATASET.num_ex_per_class, which is set to 200 examples per class used. If you count just the 101 Imagenet categories, then the number is about right, but with VOC it should be a bit higher. Also, there should be image augmentation (horisontal mirroring), which introduces a factor of 2 (TRAIN.FLIP=True). Please attach here your log file for the training.

We used additional VOC data for richer training. Indeed one needs to track the +20. I will later relate to the code you cite above and check if it works as intended.

liyangliu commented 4 years ago

Thanks @jshtok, here is the log file.resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8_2019-09-14-23-37.log

jshtok commented 4 years ago

Hi, unfortunately the total number of training samples is not printed out to the log, but in general it looks fine.

liyangliu commented 4 years ago

You can see from the following that only 200 samples from the first 101 classes are used each epoch. https://github.com/jshtok/RepMet/blob/22aec096cd89a06d008fdd5b335a816545bdf073/fpn/core/loader.py#L287-L309

liyangliu commented 4 years ago

total size is not printed in the log file, but in the terminal.

jshtok commented 4 years ago

Yes, looks you're training with just the Imagenet classes. You can remove the restriction to 200 samples per class, as it was added for speed. Regarding the code for adding the categ_index_offs value: In the loader.py, values in clsIds2use are coming from the list of training classes (english names), converted to class ordinals by cls2id_map. So cls2id_map must produce the correct ordinals.

liyangliu commented 4 years ago

Does it support multi gpu testing and fine-tuning when doing benchmark episodic evaluation? Thanks.

liyangliu commented 4 years ago

Would you mind uploading a training log file and testing file for 500 episodes, 5 shot 5 way, 50 query with and without fine-tuning? Because I can only achieve ~30 AP without fine-tuning after training for 20 epochs. My training log as last is as follows. The visualization of few shot detection seems right.

liyangliu commented 4 years ago

I think maybe the uploaded training code is not correct respecting the lr scheduling, https://github.com/jshtok/RepMet/blob/22aec096cd89a06d008fdd5b335a816545bdf073/fpn/train_end2end.py#L248-L250 Note that *len(roidb) == (number of all images 2)**, but actually we only use images from the first 101 (121) classes during training, so the learning rate decay steps (lr_iters) are not calculated correctly.

jshtok commented 4 years ago

Hi Liyang,

Your model has not trained very well. We have observed 55% FG accuracy in end of Epoch 6 and 73% FG accuracy in Epoch 14. One possible difference is the Pascal VOC data that is absent from your training. Please return it to the training. Please try just the first 6 epochs, to see if the performance is recovered.

I will check the possible issue you've pointed out - whether inloc_cls2id_map.pkl takes the +20 classes into consideration.

jshtok commented 4 years ago

I think maybe the uploaded training code is not correct respecting the lr scheduling, https://github.com/jshtok/RepMet/blob/22aec096cd89a06d008fdd5b335a816545bdf073/fpn/train_end2end.py#L248-L250

Note that *len(roidb) == (number of all images 2)**, but actually we only use images from the first 101 (121) classes during training, so the learning rate decay steps (lr_iters) are not calculated correctly.

Thank you for pointing out this potential issue. I will try to check whether this has real influence.

liyangliu commented 4 years ago

Do you use all images of the first 101 classes in ImageNet (*53812 images 2 (due to flipping)) or just 200 images from each class of the first 101 classes (20200** images) in each epoch, cause this setting largely affects images used in each epoch.

jshtok commented 4 years ago

Do you use all images of the first 101 classes in ImageNet (*53812 images 2 (due to flipping)) or just 200 images from each class of the first 101 classes (20200** images) in each epoch, cause this setting largely affects images used in each epoch.

Like I said before, there is a parameter controlling how many images per category are used. Indeed, we used 200 image per category. You can lift this restriction by modifying DATASET.num_ex_per_class.

The restriction to 200 is applied after the flipping, so you have just the 20200 images in total.

jshtok commented 4 years ago

For a reference, when training on both datasets, I have the following printout in the terminal: 'test_idx': 0} num_images 5011 voc_2007_trainval gt roidb loaded from ./data/cache/voc_2007_trainval_gt_roidb.pkl append flipped images to roidb num_images 11540 voc_2012_trainval gt roidb loaded from ./data/cache/voc_2012_trainval_gt_roidb.pkl append flipped images to roidb num_images 544546 imagenet_clsloc_train_loc gt roidb loaded from ./data/cache/imagenet_clsloc_train_loc_gt_roidb.pkl append flipped images to roidb filtered 0 roidb entries: 1122194 -> 1122194

jshtok commented 4 years ago

Regarding training with both datasets, you were right regarding the cls2id_map - the original inloc_cls2id_map.pkl and inloc_first101_categories.txt did not account for the PAscal VOC. Thank you for pointing out this issue. I have uploaded two update files, Pascal_inloc_cls2id_map.pkl and Pascal_inloc_first101_categories.txt, and updated the resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml.

Now the joint training should work.

liyangliu commented 4 years ago

Thanks for your reply. After changing the learning rate scheduling, my R-CNN FG Accuracy (trained with only the first 101 classes from ImageNet) after 6 epochs is The t-SNE of the representatives are as follows. Does it seem right ? I find that the learning rate decays at the 4 and 6 epoch in the training configure file. https://github.com/jshtok/RepMet/blob/bb7cfcd5cbf4ba876317e35328b0b1991c537b5e/experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml#L94-L97 But the total number of epochs is 20. Do you really decay this early to obtain the results in your CVPR paper ?

jshtok commented 4 years ago

Hi, Indeed, we used lr_step: '4,6,20' in the training for the paper. It is possible that better results will be obtained with a different scheduler, we didn't optimize for these parameters.

As for the TSNE image you show, I believe it is fine, though it is not very representative of the training situation.

Can you please elaborate on how did you change the LR schedule to improve the accuracy? Thanks!

liyangliu commented 4 years ago

I changed len(roidb) to train_data.size in the following line, https://github.com/jshtok/RepMet/blob/bb7cfcd5cbf4ba876317e35328b0b1991c537b5e/fpn/train_end2end.py#L248 But after changing this, I can only reach FG accuracy 68% after 11 epochs, and I don't think I can reach performance after 14 epochs you mentioned before, because there won't be no more learning rate decay.

Your model has not trained very well. We have observed 55% FG accuracy in end of Epoch 6 and 73% FG accuracy in Epoch 14.

jshtok commented 4 years ago

Thank you for this input, I will check this when I train the RepMet in the future. Possibly, the optimal value is somewhere in between.

The performance numbers I cited were for training of Pascal+Imagenet datasets. Perhaps it makes the difference.

liyangliu commented 4 years ago

Hi, @jshtok, in the paper you said But the uploaded training code loads the pretrained model from that of ImageNet, does this matter?

jshtok commented 4 years ago

the pretrained model is set to pretrained: '/dccstor/jsdata1/dev/RepMet/model/pretrained_model/resnet_v1_101' how do you conclude it is pretrained on Imagenet?

liyangliu commented 4 years ago

Sorry, I'm not sure what pretrained model you actually load. But in the "data" folder you uploaded to google drive, there is only "https://drive.google.com/open?id=1bW65QZftDDg7XVexGh957b_7FgHZP0fn". It only has the resnet backbone, so I don't think this model has been trained on COCO.

Would you mind uploading a COCO pretrained model to google drive? Thanks. I wonder the COCO pretrained model as you mentioned '/dccstor/jsdata1/dev/RepMet/model/pretrained_model/resnet_v1_101' is trained on COCO from scratch or fine-tuned from ImageNet?

jshtok commented 4 years ago

Hi Liyang, I have checked this issue with Leonid, who did the original training, and apparently I made a mistake. Indeed the model resnet_v1_101 is not the one pretrained on COCO. I am now uploading the correct model into \RepMet_CVPR19_data\data\fpn_dcn_coco-0000.params, and I will update the training .yaml accordingly.

I apologize for misleading you.

liyangliu commented 4 years ago

OK thanks. As far as I know, according to "Rethinking ImageNet Pre-training" by Kaiming He, if we want to train detection models on COCO from scratch, we need to add GN or SyncBN in the model, and we need to train much longer than usual (6 times more epochs), do you use these techniques?

leokarlin commented 4 years ago

no we didn't use group norm in our original experiments, I agree it would be interesting to add it (and SyncBN) and to see if it helps (for the few-shot detection, not necessarily for the pre-training)

fityanul commented 4 years ago

Dear Author,

I also want to train this code and found this error: "ImportError: cannot import name bbox_overlaps_cython" Please let me know, what kind of problem is this?

Thank You

jshtok commented 4 years ago

Hi, I believe your cuda version is not 8.0, for which the c code was compiled. Can you use the cuda 8.0?

On Tue, 24 Sep 2019, 19:41 fityanul, notifications@github.com wrote:

Dear Author,

I also want to train this code and found this error: "ImportError: cannot import name bbox_overlaps_cython" Please let me know, what kind of problem is this?

Thank You

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jshtok/RepMet/issues/3?email_source=notifications&email_token=ACOBU6XPKLLDWV5K467ABY3QLI7MNA5CNFSM4IUWENA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PAGYI#issuecomment-534643553, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOBU6QVXQFWQ7DR6VVSCATQLI7MNANCNFSM4IUWENAQ .

fityanul commented 4 years ago

Hi, I believe your cuda version is not 8.0, for which the c code was compiled. Can you use the cuda 8.0? … On Tue, 24 Sep 2019, 19:41 fityanul, @.***> wrote: Dear Author, I also want to train this code and found this error: "ImportError: cannot import name bbox_overlaps_cython" Please let me know, what kind of problem is this? Thank You — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3?email_source=notifications&email_token=ACOBU6XPKLLDWV5K467ABY3QLI7MNA5CNFSM4IUWENA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PAGYI#issuecomment-534643553>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOBU6QVXQFWQ7DR6VVSCATQLI7MNANCNFSM4IUWENAQ .

Dear @jshtok First, Thank You very much for the information.

I have tried to rebuild the environment with python 2.7 and Cuda 8.0, but i found a similar error as below:

RepMet-master>python -V Python 2.7.16 :: Anaconda, Inc.

RepMet-master>python ./experiments/fpn_end2end_train_test.py --cfg=./experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml Traceback (most recent call last): File "./experiments/fpn_end2end_train_test.py", line 23, in import train_end2end File "./experiments..\fpn\train_end2end.py", line 41, in from symbols import File "./experiments..\fpn\symbols__init__.py", line 1, in import resnet_v1_101_fpn_rcnn File "./experiments..\fpn\symbols\resnet_v1_101_fpn_rcnn.py", line 12, in from operator_py.pyramid_proposal import File "./experiments..\fpn\operator_py\pyramid_proposal.py", line 14, in from bbox.bbox_transform import bbox_pred, clip_boxes File "./experiments..\fpn..\lib\bbox\bbox_transform.py", line 2, in from bbox import bbox_overlaps_cython ImportError: cannot import name bbox_overlaps_cython

Please let me know, there is something wrong here?

Thank You

jshtok commented 4 years ago

bbox_overlaps_cython

Hi @fityanul , I understand that still the supplied compiled file bbox.so does not match your system. Please run the setup_linux.py / setup_windows.py (depending on your OS) under ./lib/bbox to compile this file anew.

fityanul commented 4 years ago

bbox_overlaps_cython

Hi @fityanul , I understand that still the supplied compiled file bbox.so does not match your system. Please run the setup_linux.py / setup_windows.py (depending on your OS) under ./lib/bbox to compile this file anew.

Dear @jshtok

Always thank You very much for Your reply. I found this is an environment setup problem and tried following Your step as bellow:

RepMet-master\lib\bbox>python setup_windows.py install running install running bdist_egg running egg_info creating fast_rcnn.egg-info writing fast_rcnn.egg-info\PKG-INFO writing top-level names to fast_rcnn.egg-info\top_level.txt writing dependency_links to fast_rcnn.egg-info\dependency_links.txt writing manifest file 'fast_rcnn.egg-info\SOURCES.txt' reading manifest file 'fast_rcnn.egg-info\SOURCES.txt' writing manifest file 'fast_rcnn.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_ext cythoning bbox.pyx to bbox.c building 'bbox' extension creating build creating build\temp.win-amd64-2.7 creating build\temp.win-amd64-2.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Users\FITYANUL\Anaconda2\envs\mxnet\lib\site-packages\numpy\core\include -IC:\Users\FITYANUL\Anaconda2\envs\mxnet\include -IC:\Users\FITYANUL\Anaconda2\envs\mxnet\PC /Tcbbox.c /Fobuild\temp.win-amd64-2.7\Release\bbox.obj bbox.c c:\users\fityanul\anaconda2\envs\mxnet\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION bbox.c(1943): warning C4244: '=': conversion from 'npy_intp' to 'unsigned int', possible loss of data bbox.c(1952): warning C4244: '=': conversion from 'npy_intp' to 'unsigned int', possible loss of data creating build\lib.win-amd64-2.7 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:C:\Users\FITYANUL\Anaconda2\envs\mxnet\libs /LIBPATH:C:\Users\FITYANUL\Anaconda2\envs\mxnet\PCbuild\amd64 /LIBPATH:C:\Users\FITYANUL\Anaconda2\envs\mxnet\PC\VS9.0\amd64 /EXPORT:initbbox build\temp.win-amd64-2.7\Release\bbox.obj /OUT:build\lib.win-amd64-2.7\bbox.pyd /IMPLIB:build\temp.win-amd64-2.7\Release\bbox.lib /MANIFESTFILE:build\temp.win-amd64-2.7\Release\bbox.pyd.manifest bbox.obj : warning LNK4197: export 'initbbox' specified multiple times; using first specification Creating library build\temp.win-amd64-2.7\Release\bbox.lib and object build\temp.win-amd64-2.7\Release\bbox.exp creating build\bdist.win-amd64 creating build\bdist.win-amd64\egg copying build\lib.win-amd64-2.7\bbox.pyd -> build\bdist.win-amd64\egg creating stub loader for bbox.pyd byte-compiling build\bdist.win-amd64\egg\bbox.py to bbox.pyc creating build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO writing build\bdist.win-amd64\egg\EGG-INFO\native_libs.txt zip_safe flag not set; analyzing archive contents... creating dist creating 'dist\fast_rcnn-0.0.0-py2.7-win-amd64.egg' and adding 'build\bdist.win-amd64\egg' to it removing 'build\bdist.win-amd64\egg' (and everything under it) Processing fast_rcnn-0.0.0-py2.7-win-amd64.egg Copying fast_rcnn-0.0.0-py2.7-win-amd64.egg to c:\users\fityanul\anaconda2\envs\mxnet\lib\site-packages Adding fast-rcnn 0.0.0 to easy-install.pth file

Installed c:\users\fityanul\anaconda2\envs\mxnet\lib\site-packages\fast_rcnn-0.0.0-py2.7-win-amd64.egg Processing dependencies for fast-rcnn==0.0.0 Finished processing dependencies for fast-rcnn==0.0.0

But, i still have same problem:

RepMet-master>python ./experiments/fpn_end2end_train_test.py --cfg=./experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml Traceback (most recent call last): File "./experiments/fpn_end2end_train_test.py", line 23, in import train_end2end File "./experiments..\fpn\train_end2end.py", line 41, in from symbols import File "./experiments..\fpn\symbols__init__.py", line 1, in import resnet_v1_101_fpn_rcnn File "./experiments..\fpn\symbols\resnet_v1_101_fpn_rcnn.py", line 12, in from operator_py.pyramid_proposal import File "./experiments..\fpn\operator_py\pyramid_proposal.py", line 14, in from bbox.bbox_transform import bbox_pred, clip_boxes File "./experiments..\fpn..\lib\bbox\bbox_transform.py", line 2, in from bbox import bbox_overlaps_cython ImportError: cannot import name bbox_overlaps_cython

Please let me know, maybe i made a mistake when run setup_windows.py

Thank You

fityanul commented 4 years ago

bbox_overlaps_cython

Hi @fityanul , I understand that still the supplied compiled file bbox.so does not match your system. Please run the setup_linux.py / setup_windows.py (depending on your OS) under ./lib/bbox to compile this file anew.

Dear @jshtok

Always thank You very much for Your reply. I found this is an environment setup problem and tried following Your step as bellow:

RepMet-master\lib\bbox>python setup_windows.py install running install running bdist_egg running egg_info creating fast_rcnn.egg-info writing fast_rcnn.egg-info\PKG-INFO writing top-level names to fast_rcnn.egg-info\top_level.txt writing dependency_links to fast_rcnn.egg-info\dependency_links.txt writing manifest file 'fast_rcnn.egg-info\SOURCES.txt' reading manifest file 'fast_rcnn.egg-info\SOURCES.txt' writing manifest file 'fast_rcnn.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_ext cythoning bbox.pyx to bbox.c building 'bbox' extension creating build creating build\temp.win-amd64-2.7 creating build\temp.win-amd64-2.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Users\FITYANUL\Anaconda2\envs\mxnet\lib\site-packages\numpy\core\include -IC:\Users\FITYANUL\Anaconda2\envs\mxnet\include -IC:\Users\FITYANUL\Anaconda2\envs\mxnet\PC /Tcbbox.c /Fobuild\temp.win-amd64-2.7\Release\bbox.obj bbox.c c:\users\fityanul\anaconda2\envs\mxnet\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION bbox.c(1943): warning C4244: '=': conversion from 'npy_intp' to 'unsigned int', possible loss of data bbox.c(1952): warning C4244: '=': conversion from 'npy_intp' to 'unsigned int', possible loss of data creating build\lib.win-amd64-2.7 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\link.exe /DLL /nologo /INCREMENTAL:NO /LIBPATH:C:\Users\FITYANUL\Anaconda2\envs\mxnet\libs /LIBPATH:C:\Users\FITYANUL\Anaconda2\envs\mxnet\PCbuild\amd64 /LIBPATH:C:\Users\FITYANUL\Anaconda2\envs\mxnet\PC\VS9.0\amd64 /EXPORT:initbbox build\temp.win-amd64-2.7\Release\bbox.obj /OUT:build\lib.win-amd64-2.7\bbox.pyd /IMPLIB:build\temp.win-amd64-2.7\Release\bbox.lib /MANIFESTFILE:build\temp.win-amd64-2.7\Release\bbox.pyd.manifest bbox.obj : warning LNK4197: export 'initbbox' specified multiple times; using first specification Creating library build\temp.win-amd64-2.7\Release\bbox.lib and object build\temp.win-amd64-2.7\Release\bbox.exp creating build\bdist.win-amd64 creating build\bdist.win-amd64\egg copying build\lib.win-amd64-2.7\bbox.pyd -> build\bdist.win-amd64\egg creating stub loader for bbox.pyd byte-compiling build\bdist.win-amd64\egg\bbox.py to bbox.pyc creating build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO copying fast_rcnn.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO writing build\bdist.win-amd64\egg\EGG-INFO\native_libs.txt zip_safe flag not set; analyzing archive contents... creating dist creating 'dist\fast_rcnn-0.0.0-py2.7-win-amd64.egg' and adding 'build\bdist.win-amd64\egg' to it removing 'build\bdist.win-amd64\egg' (and everything under it) Processing fast_rcnn-0.0.0-py2.7-win-amd64.egg Copying fast_rcnn-0.0.0-py2.7-win-amd64.egg to c:\users\fityanul\anaconda2\envs\mxnet\lib\site-packages Adding fast-rcnn 0.0.0 to easy-install.pth file

Installed c:\users\fityanul\anaconda2\envs\mxnet\lib\site-packages\fast_rcnn-0.0.0-py2.7-win-amd64.egg Processing dependencies for fast-rcnn==0.0.0 Finished processing dependencies for fast-rcnn==0.0.0

But, i still have same problem:

RepMet-master>python ./experiments/fpn_end2end_train_test.py --cfg=./experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml Traceback (most recent call last): File "./experiments/fpn_end2end_train_test.py", line 23, in import train_end2end File "./experiments..\fpn\train_end2end.py", line 41, in from symbols import File "./experiments..\fpn\symbolsinit.py", line 1, in import resnet_v1_101_fpn_rcnn File "./experiments..\fpn\symbols\resnet_v1_101_fpn_rcnn.py", line 12, in from operator_py.pyramid_proposal import File "./experiments..\fpn\operator_py\pyramid_proposal.py", line 14, in from bbox.bbox_transform import bbox_pred, clip_boxes File "./experiments..\fpn..\lib\bbox\bbox_transform.py", line 2, in from bbox import bbox_overlaps_cython ImportError: cannot import name bbox_overlaps_cython

Please let me know, maybe i made a mistake when run setup_windows.py

Thank You

Dear @jshtok

I am already ongoing for the training process after using this command:

"python setup_windows.py build_ext --inplace"

and here is a little problem that i found:

RepMet-master>python ./experiments/fpn_end2end_train_test.py --cfg=./experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml ('Called with argument:', Namespace(cfg='./experiments/cfgs/resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.yaml', debug=0, frequent=10)) Traceback (most recent call last): File "./experiments/fpn_end2end_train_test.py", line 31, in train_end2end.main() File "./experiments..\fpn\train_end2end.py", line 291, in main config.TRAIN.begin_epoch, config.TRAIN.end_epoch, config.TRAIN.lr, config.TRAIN.lr_step) File "./experiments..\fpn\train_end2end.py", line 63, in train_net logger, final_output_path = create_logger(config.output_path, args.cfg, config.dataset.image_set) File "./experiments..\fpn..\lib\utils\create_logger.py", line 31, in create_logger logging.basicConfig(filename=os.path.join(final_output_path, log_file), format=head) File "C:\Users\FITYANUL\Anaconda2\envs\mxnet\lib\logging__init.py", line 1554, in basicConfig hdlr = FileHandler(filename, mode) File "C:\Users\FITYANUL\Anaconda2\envs\mxnet\lib\logging__init.py", line 920, in init StreamHandler.init__(self, self._open()) File "C:\Users\FITYANUL\Anaconda2\envs\mxnet\lib\logging\init__.py", line 950, in _open stream = open(self.baseFilename, self.mode) IOError: [Errno 2] No such file or directory: 'output\fpn\voc_imagenet\resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8\2007_trainval+2012_trainval_train_loc\resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8_2019-09-29-01-32.log'

Please let me know, if any insight for me.

Thank You

jshtok commented 4 years ago

@fityanul Hi, I am glad you solved the issue with the cyton, I didn't have experience with this in WIndows. The error you're showing now is strange, since it is caused by the attempt of the logger to create a log file in the folder that was not yet created. In my runs, this folder was created in the code, but something went wrong. Please try to make the folder output\fpn\voc_imagenet\resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8\2007_trainval+2012_trainval_train_loc manually and run the training.

fityanul commented 4 years ago

@fityanul Hi, I am glad you solved the issue with the cyton, I didn't have experience with this in WIndows. The error you're showing now is strange, since it is caused by the attempt of the logger to create a log file in the folder that was not yet created. In my runs, this folder was created in the code, but something went wrong. Please try to make the folder output\fpn\voc_imagenet\resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8\2007_trainval+2012_trainval_train_loc manually and run the training.

Dear @jshtok Thank You very much for Your reply.

There is no problem with creating the folders and i think the problem is no log file inside the folder. i am trying full model training with 1 GPU using this dataset: ILSVRC2017_CLS-LOC.tar.gz and i the problem is the .yaml file configuration.

Here is attached my .yaml config: Please let me know if something wrong here.

resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.zip

Thank You

jshtok commented 4 years ago

@fityanul Hi, I am glad you solved the issue with the cyton, I didn't have experience with this in WIndows. The error you're showing now is strange, since it is caused by the attempt of the logger to create a log file in the folder that was not yet created. In my runs, this folder was created in the code, but something went wrong. Please try to make the folder output\fpn\voc_imagenet\resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8\2007_trainval+2012_trainval_train_loc manually and run the training.

Dear @jshtok Thank You very much for Your reply.

There is no problem with creating the folders and i think the problem is no log file inside the folder. i am trying full model training with 1 GPU using this dataset: ILSVRC2017_CLS-LOC.tar.gz and i the problem is the .yaml file configuration.

Here is attached my .yaml config: Please let me know if something wrong here.

resnet_v1_101_voc0712_trainval_fpn_dcn_oneshot_end2end_ohem_8.zip

Thank You

Dear @fityanul, Indeed, there are issues in the .yaml file I have resolved in a later commit. Specifically, the pretrained model path is still from my storage, '/dccstor/...'. This shouldn't be anywhere in your configuration. Please see the current version of the ..._ohem_8.yaml in the repository. Notice that the argument 'pretrained:' does not contain a full path to the initial network file, but the ending of file name should be missing.

jshtok / RepMet

training code #3