data split mistake in your code.

penghu-cs / UCCH

Unsupervised Contrastive Cross-modal Hashing (IEEE TPAMI 2023, PyTorch Code)

52 stars 10 forks source link

data split mistake in your code. #5

Closed kalenforn closed 1 year ago

kalenforn commented 1 year ago

I noticed that there may be some issues with how the dataset is being split. In src/cmdataset.py, line 138 and subsequent 'else' statements, the training dataset may not be properly separated from the retrieval dataset. As a result, I found that the lengths of the train_dataset and retrieval_dataset were the same when I printed them in UCCH.py. This could potentially lead to the model overfitting due to the presence of prior information during training. I kindly request your attention to this matter and would greatly appreciate it if you could look into fixing this.

penghu-cs commented 1 year ago

Thank you for your attention. This setting is commonly used in unsupervised cross-modal hashing, e.g., [13]. It is also more applicable in real-world applications since unlabeled data tends to be more readily available, and we can use the retrieval set as the training set. It is important to note, however, that the query set should remain unseen during testing. Additionally, this setting cannot produce overfitting problems, which may often arise from small data or flawed training strategies and models.

[13] Li C, Deng C, Wang L, et al. Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval[C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33(01): 176-183.

kalenforn commented 1 year ago

I think the training data is randomly selected in the retrieval data, but they are the same in your training code.

penghu-cs commented 1 year ago

This setting is commonly used in unsupervised cross-modal hashing, e.g., [13]. In unsupervised cross-modal retrieval, the retrieval set is often used as the training set.

kalenforn commented 1 year ago

imgs, tags, labels = imgs[inx], tags[inx], labels[inx]
    test_size = 2000
    if 'test' in partition.lower():
        imgs, tags, labels = imgs[-test_size::], tags[-test_size::], labels[-test_size::]
    else:
        imgs, tags, labels = imgs[0: -test_size], tags[0: -test_size], labels[0: -test_size]

    return imgs.transpose([0, 3, 2, 1]), tags, labels, root

here is your data split method, but where is the training set? Your paper is written as "we randomly select 5,000 pairs from the retrieval database as their training set.", while there isn't a training set selecting strategy in your code. Do you forget to provide it?

penghu-cs commented 1 year ago

Hi,

This is for the supervised baselines. Thus, it is not in our method. Thanks.

Best regards, Peng Hu

Lucky-Light-Sun commented 9 months ago

So the Datasets config in the UCCH paper section 4.1 is mainly for supervised methods. UCCH train dataset is just the same as retrieval dataset. Do I get the point?

penghu-cs commented 9 months ago

font{
    line-height: 1.6;
}
ul,ol{
    padding-left: 20px;
    list-style-position: inside;
}

Yes. We have stated the configuration for unsupervised and supervised methods in the section.Best,Peng

                            penghu.ml

                                ***@***.***

---- Replied Message ----

     From 

        Youguang ***@***.***>

     Date 

    01/23/2024 15:00

     To 

        ***@***.***>

     Cc 

        ***@***.***>
        ,

        State ***@***.***>

     Subject 

          Re: [penghu-cs/UCCH] data split mistake in your code. (Issue #5)

So the Datasets config in the UCCH paper section 4.1 is mainly for supervised methods. UCCH train dataset is just the same as retrieval dataset. Do I get the point? image.png (view on web)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>