A question about data leakage.

r1cheu commented 1 year ago

Hi @keyu-tian! Thank you for proposing the amazing work. I have a question about data leakage. Here is my situation. I have a unlabled dataset about 200k images. Can I use the whole dataset to train a backbone, then use the backbone to do detection job on the same dataset (but small part of the dataset with label)? Will it casue a data leakage issue?since during SparK training may contain the image in valid set. And my final goal is to count objects in the whole dataset(200k) by object detection.

r1cheu commented 1 year ago

by the way, the " around the ${*:2} in pretrain/main.sh line 43 will take all args as one str arg and casue a errer below in my test.

launch.py: error: argument --num_nodes: invalid int value: '1 --ngpu_per_node=8 --data_path=/xxx/xxx/xxx --model=resnet50 --bs=512'

after I remove the " it works fine.

keyu-tian commented 1 year ago

@r1cheu thank you so much for the bug fix. I therefore have removed all main.sh and launch.py and give the according torchrun commands directly in READMEs.

As for your question, I think the key is your detailed target. If this is an academic research, I agree that excluding validation data from the pretraining is necessary, otherwise it may cause data leakage. But if all you want is better self-labeling quality, like you're just playing with your own dataset, then I think it would be better to do SparK pretraining on all data, because this allows the model to become familiar with those unlabeled validation data before object finetuning.

ps: if this is a research I would recommend you to ask other researchers in your field

r1cheu commented 1 year ago

Thanks for your reply. And I found the viz_reconstruction.ipynb is not proper for loading the correct weights when I tried to visulize my custom pertrained model. Add a tutorial for it might help. Here is my modification.

    # change the path to custom pretrain checkpoint, e.g. output_debug/resnet50_still_pretraining.pth
    using_bn_in_densify = 'densify_norms.0.running_mean' in pretrained_state['module'] # add module to get model weight
    # build a SparK model
    enc: SparseEncoder = build_sparse_encoder(model_name, input_size=input_size)
    spark = SparK(
        sparse_encoder=enc, dense_decoder=LightDecoder(enc.downsample_raito, sbn=True), # sbn=True
        mask_ratio=0.6, densify_norm='bn' if using_bn_in_densify else 'ln', sbn=True,
    ).to(DEVICE)
    spark.eval(), [p.requires_grad_(False) for p in spark.parameters()]

    # load the checkpoint
    missing, unexpected = spark.load_state_dict(pretrained_state['module'], strict=False) # another module

And how about add a load_from option, used to just load the pretrained weights and training on it.

keyu-tian commented 1 year ago

Many thanks for your experience and advice! For the notebook, I tend to keep it unchanged and leave it to the user to modify as they wish, since different people may have different requirements.

For load_from i think it's a nice idea, like loading our pretrained weights and performing a new pretraining on your own dataset. This is worth trying and may yield better results than pretraining your dataset from scratch. I have added this option.

keyu-tian commented 1 year ago

I have updated the viz_reconstruction.ipynb in 1468df8 referred to your modification. Thank you again.

keyu-tian / SparK

A question about data leakage. #38