jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
86 stars 26 forks source link

Fine-tune on Custom Dataset #12

Closed JFcy closed 1 month ago

JFcy commented 3 months ago

Hello,

I am looking to fine-tune your project recently. Could you please advise at which step I should replace the .pth with the pre-trained model? When I ran the code myself, I started from step three, replacing phase2_checkpoint with the pre-trained model, but encountered the following issues. Could you please offer any solutions? error

jchenghu commented 3 months ago

Hi

Regarding the fine-tuning part, I assume you want to change the head of the model (the last prediction layer of size linear(512, num_classes) and replace it with a linear(512, 126) where 126 refers to the number of classes of your custom dataset

You can use the stage 3 checkpoint for faster testing if your input consists of the COCO images since stage 3 checkpoint wants the backbone features as input. If this that's the case that's good.

If your custom dataset consists not only of a different number of classes but is also not part of COCO you might need to update the stage 4 model which is the complete model, comprising of backbone and fusion model


Regarding the output error described by your image, 10.000 is the initial vocabulary size, and 126 seem to be your custom number of classes, to fix this error, I suggest to add something like

def partially_load_state_dict(model, state_dict, verbose=False, max_num_print=5):
    own_state = model.state_dict()
    max_num_print = max_num_print
    count_print = 0
    for name, param in state_dict.items():

        if name.startswith('vocab_linear'):
            continue
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                             ______________ this bit

        if name not in own_state:
            if verbose:
                print("Not found: " + str(name))
            continue
        if isinstance(param, Parameter):
            param = param.data
        own_state[name].copy_(param)
        if verbose:
            if count_print < max_num_print:
                print("Found: " + str(name))
                count_print += 1

in the saving utils function, which basically skips the loading of the vocabulary projection at the end and should fix your issue

Let me know if it helps,

JFcy commented 3 months ago

I apologize for any misunderstandings my previous statements may have caused. I intended to conduct further training on the model you had already trained, using my own constructed dataset to see what kind of results could be achieved. Therefore, I started training the model you had trained according to the third step in the project's readme, changing the limited_num_val_images value in the dataset to 3, and setting the phase2_checkpoint to the rf_model.pth that you had trained. An error as shown in the above image occurred during the training process. Indeed, as you said, 126 is the number of categories in the dataset I have constructed. Moreover, just now when I tried to modify the content in the partially_load_state_dict function as you suggested, the same error still occurred.

If I intend to further train your model, do I need to ensure that the vocabulary size of my dataset reaches 10,000? Additionally, if I want to proceed with the next training of your well-trained model, I change the phase2_checkpoint to rf_model.pth in step 3, and then continue training according to the readme file in the project, what issues should I pay attention to?

jchenghu commented 3 months ago

Hi,

Sorry If I did not understand your intentions, let me recap a little bit,

You would like to perform fine-tuning, in the form of end-to-end Cross-Entropy training (since you start from stage 3) on top of the rf_model.pth which already is an end-to-end checkpoint (contains weights of both backbone and fusion model) with your custom dataset, which consists of 126 labels.

Is that correct?

About the best way to do it, I think it depends on the nature of these 126 labels

Scenario A: if there is no overlap between these 126 labels and the 10.000 that are currently used, then I think you should just get rid of the existing 10000 and put a newly initialized one of 126 classes to be fine-tuned with additional training steps.

Beware however, that if no overlap exists, the weights of the decoder are not helpful anymore, since we are not only discarding the head (as in a typical LLM fashion) but also the input embeddings.

Scenario B: If overlap exists between 126 labels and these 10000 classes (although not the same token, but maybe just semantically?) you can design some mapping or simply discard the ones that are not considered in your domain and preserve the others. This makes use of the existing learned relationship in the current model.

That being said, I would not suggest extending the number of labels to 10.000 for the sole purpose of loading the model in general, and to keep parts of the current vocabulary (both embeddings and the head) only if overlaps between the set of 126 and 10000 classes exist.


In the next steps, I'll assume scenario A for simplicity

To do scenario A, you would like the preserve all the weights in rf_model.pth except self.vocab_linear and self.out_embedder which are previously made of 10000 classes and you would like to replace them with a new head and embeddings of 126 classes.

So, the commands in the README are configured for the alternate end-to-end and non-end-to-end training described in the paper, and they might need to be changed a little bit to adapt them to this scenario...

The commands that are showcased in Step 3 have the effect of loading the weights of the backbone and fusion model from two different checkpoints, which is the reason why you find the distinction between backbone_save_path and body_save_path:

python train.py --N_enc 3 --N_dec 3  \
    --model_dim 512 --optim_type radam --seed 775533   --sched_type custom_warmup_anneal  \
    --warmup 1 --lr 3e-5 --anneal_coeff 0.55 --anneal_every_epoch 1 --enc_drop 0.3 \
    --dec_drop 0.3 --enc_input_drop 0.3 --dec_input_drop 0.3 --drop_other 0.3  \
    --batch_size 16 --num_accum 3 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [3]  \
    --save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1  \
    --is_end_to_end True --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ --partial_load True \
    --backbone_save_path ./github_ignore_material/raw_data/swin_large_patch4_window12_384_22k.pth \
    --body_save_path ./github_ignore_material/saves/phase2_checkpoint \
    --print_every_iter 15000 --eval_every_iter 999999 \
    --reinforce False --num_epochs 2 &> output_file.txt &

Since compared to the original intention, we want to load an end-to-end checkpoint that is rf_model.pth, we can use the backbone_save_path arguments

python train.py --N_enc 3 --N_dec 3  \
    --model_dim 512 --optim_type radam --seed 775533   --sched_type custom_warmup_anneal  \
    --warmup 1 --lr 3e-5 --anneal_coeff 0.55 --anneal_every_epoch 1 --enc_drop 0.3 \
    --dec_drop 0.3 --enc_input_drop 0.3 --dec_input_drop 0.3 --drop_other 0.3  \
    --batch_size 16 --num_accum 3 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [3]  \
    --save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1  \
    --is_end_to_end True --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ \

    --backbone_save_path <PATH_TO_YOUR_RF_MODEL.PTH> 

    --print_every_iter 15000 --eval_every_iter 999999 \
    --reinforce False --num_epochs 2 &> output_file.txt &

(Basically I removed --partial_load and body_save_path, and only --save_path and --backbone_save_path remains) (obvious remark: this is copied from the readme, other configurations are up to your custom dataset :-))

Basically, despite what the name suggests backbone_save_path actually loads the entire end-to-end model and then if body_save_path is specified, it loads the weights contained in body_save_path on top of it. So we can leverage backbone_save_path to load the rf_model.pth, which is end-to-end and set body_save_path to an empty string

In the train.py file, you can change this part

if train_args.is_end_to_end:
            map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
            checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location)
            if 'model' in checkpoint.keys():
                partially_load_state_dict(model.swin_transf, checkpoint['model'])
            elif 'model_state_dict' in checkpoint.keys():
                partially_load_state_dict(model, checkpoint['model_state_dict'])
            print("Backbone loaded...", end=' ')
            map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
            checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
            partially_load_state_dict(model, checkpoint['model_state_dict'])
            print("Body loaded")

into

        if train_args.is_end_to_end:
            map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
            checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location)
            if 'model' in checkpoint.keys():
                partially_load_state_dict(model.swin_transf, checkpoint['model'])
                print("Backbone loaded...")
            elif 'model_state_dict' in checkpoint.keys():
                partially_load_state_dict(model, checkpoint['model_state_dict'])
                print("Entire model backbone loaded...")

            v ---------------------------------------------------------------- Changed here
            if path_args.body_save_path != '':
                map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
                checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
                partially_load_state_dict(model, checkpoint['model_state_dict'])
                print("Body also loaded")

So it does not necessarily also upload the body_save_path anymore, to support the command I showcased above

Now, to make everything work in your case, where you want also to change the vocabulary/embeddings of the model (from 10000 to 126), there are multiple options:

1) Go inside End_ExpansionNet_v2.py and rename self.vocab_linear into self.custom_vocab_linear (up to you which name to assign) and self.out_embedder to self.custom_out_embedder. In this way, since the partial loading stage looks for all the identical names between the checkpoint file and the new instance of the model, the old weights (made of 10000 classes) will be discarded and you can use your freshly created weights of (126 classes) on top of the other trained weights.

2) Make a custom condition in the saving utils where you ask to specifically skip the vocab_linear and self.out_embedder module like the code I partially showed above.

This can be error-prone if you modify the existing partial_load function as I did above (sorry) since we don't want to always discard the vocabulary and embeddings across all different training stages. It might create undesired behaviors. So it can be better to create a new argument like '--pretrained_model_save_path' which performs the custom partial_load described in this step without changing the existing one (so it does not break the code for the standard training phases).

Let's assume you go with the first one, which is the quickest and safest, once you have renamed the existing structures into self.custom_vocab_linear and self.custom_out_embedder, and during the creation, you ensure the new number of classes is different

model = End_ExpansionNet_v2( ...
                                    output_word2idx=your_dataset.caption_word2idx_dict,
                                    output_idx2word=your_dataset.caption_idx2word_list,
                                    ...   ^^^^ your_dataset.caption_word2idx_dict / list  made of 126 entries
                                    rank=rank)

Now, with the command above, during the loading stage, since the names are now different, the weights associated with the 10000 classes from the old checkpoints are discarded, and you should be good to go.


In conclusion:

Let's keep in touch, Best

JFcy commented 3 months ago

As you suggested in scenario A, I change self.vocab_linear into self.custom_vocab_linear and self.out_embedder to self.custom_out_embedder but when I run the code , the error has changed to image.

JFcy commented 3 months ago

Also I tried options 2 , skipping the the vocab_linear and self.out_embedder module like: if name.startswith('vocab_linear') or name.startswith('out_embedder'): continue it didn't work too. I print the name of state_dict.items(), for name, param in state_dict.items(): if name.startswith('vocab_linear') or name.startswith('out_embedder'): continue print("++++++++++",name) the result is image It seems like the pos_encoder is mismatch.

jchenghu commented 3 months ago

hi, yeah in both cases the issue seems to lie in the positional encoder, it is currently implemented with additional embeddings instead of the classic parameterless positional encoders. I think the most straightforward solution would be artificially setting the maximum sequence length to 74 instead of 58 (the one of your custom dataset) during the construction.

model = End_ExpansionNet_v2( ...
                                  max_seq_len=74 
                                   )

Since the maximum sequence lenght in your custom dataset is smaller than the original one, it should face no consequences

Let me know if it works

JFcy commented 3 months ago

I'm sorry for the delayed response. While running the code, I encountered numerous small issues, which took a considerable amount of time to resolve. Ultimately, I still faced the following problems:

1. cuda out of memory

Firstly, I encountered a "cuda out of memory" error. Here is the status of my GPU. image Previously, I tried adjusting the train batch size and using CUDA_VISIBLE_DEVICES=3 to run the code, which allowed the entire process to run successfully. However, after modifying my code according to your suggestions and running it again using CUDA_VISIBLE_DEVICES=3, the code always stops at checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location). I haven't been able to identify the cause of this issue. I wonder if you might have any suggestions.

2. Change the head of the model.

In the course of my research, I also tried to follow your initial suggestion to modify the model's head in the original version by changing linear(512, num_classes) to linear(512, 126). I made the following changes to the models.End_ExpansionNet_v2.py and models.ExpansionNet_v2.py files.

    # self.custom_vocab_linear = torch.nn.Linear(d_model, len(output_word2idx))
    self.custom_vocab_linear = torch.nn.Linear(d_model, 126)

However, I encountered a "cuda error: device-side assert triggered" issue. Does this mean I should modify other parts of the code besides linear()? Do you have any suggestions for this?

Thank you for your patience and help.

jchenghu commented 3 months ago

Hi,

About the first issue, It is indeed strange, if the code was running fine before the modification with multiple GPUs, theoretically speaking, since my suggestions were headed toward reducing the current model size, it should not give issues during the torch.load in particular...

let's try this quick fix... can you try replacing map_location with map_location='cuda:%d' % rank and let me know if it works? It might be a bug in my code

About the second point, I would suggest you preserve len(output_word2idx)) instead of 126 just to make sure you're not leaving out some special tokens in the dictionary, but I agree this should not have been the issue given the previous output you showed me.

I'll paste here what I had in mind

In the constructor

class End_ExpansionNet_v2(CaptioningModel):
    def __init__(self, 
                        ...
                        ):
        super(End_ExpansionNet_v2, self).__init__()

        self.swin_transf = ...
        ...

        self.input_embedder_dropout = nn.Dropout(drop_args.enc_input)
        self.input_linear = torch.nn.Linear(final_swin_dim, d_model)

        self.custom_vocab_linear = torch.nn.Linear(d_model, len(output_word2idx))
        ^^^ Add prefix custom_ here

        self.log_softmax = nn.LogSoftmax(dim=-1)

        self.out_enc_dropout = nn.Dropout(drop_args.other)
        self.out_dec_dropout = nn.Dropout(drop_args.other)

        self.custom_out_embedder = EmbeddingLayer(len(output_word2idx), d_model, drop_args.dec_input)
        ^^^^ Add prefix custom here

        self.pos_encoder = nn.Embedding(max_seq_len, d_model)
        ^^^^ do not change here

        ...

replicate the edit in the forward_decoder:

     def forward_dec(...):
           ...  
           y = self.out_embedder(dec_input)  --> y = self.custom_out_embedder(dec_input)
           ...
           y = self.vocab_linear(y)  --> y = self.custom_vocab_linear(y)

Also, make sure to manually set the length to 76 during the instantiation, for reasons previously discussed

if train_args.is_end_to_end:
        from models.End_ExpansionNet_v2 import End_ExpansionNet_v2
        model = End_ExpansionNet_v2(

                                    ...
                                    max_seq_len=76, 
                                                         ^^^ <--- here
                                    ...
                                    )

About the reason why the device asserts trigger, If I have to guess, the data at some point asked the embeddings a higher index than what was expected. If you created the custom dataset based on the current dataset and data loaders, try using torch.nn.Linear(d_model, len(output_word2idx))... If you already did and did not work, I'd like to know the complete error code of the device assert trigger

Let me know if it helps :-)

JFcy commented 3 months ago

I had change the code: and change the 126 to len(output_word2idx)) .

 if train_args.is_end_to_end:
            # map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
            map_location = {'cuda:%d' % rank}
            checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location)
            if 'model' in checkpoint.keys():
                partially_load_state_dict(model.swin_transf, checkpoint['model'])
                print("Backbone loaded...")
            elif 'model_state_dict' in checkpoint.keys():
                partially_load_state_dict(model, checkpoint['model_state_dict'])
                print("Entire model backbone loaded...")
            if path_args.body_save_path != '':
                map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
                checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
                partially_load_state_dict(model, checkpoint['model_state_dict'])
                print("Body also loaded")
        else:
            if train_args.partial_load:
                # map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
                map_location = {'cuda:%d' % rank}
                checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
                partially_load_state_dict(model, checkpoint['model_state_dict'])
                print("Partial load done.")

and the output showed in output.txt: image It has been about 5 minutes, and it is still in this state just as I said before, it keeps stopping at this position.

jchenghu commented 3 months ago

Mmh.. it says num_gpus: 1 try setting '--num_gpus 4' argument

JFcy commented 3 months ago

I finally managed to train normally by modifying the caption_idx2word_list parameter to be the same as the content of the COCO dataset. However, the following results occurred. 6eb18615c94b6f65dcb0d359cf25597 As shown in the picture, the results in the upper box are from the training, and the results below are from the originally pre-trained model. For the large number of repeated words appearing in the captions, I am not sure if it is caused by a certain parameter? Or is it because the amount of data I trained is too small? I would like to know if you have any valuable suggestions?

jchenghu commented 3 months ago

Hi,

I'm glad you worked it out, although it should be working even with a different set of tokens in the vocabulary, I have a curiosity at this point, but feel free to not disclose details about your work if you don't want to: in the previous images, it showcased about 200 tokens, in your custom problem. Were they a subset of the current dictionary ? Or were they completely different classes?

This detail might be important in understanding its behaviour. My deduction is that, you were curious about the result in the case your custom dataset is directly fed into the pre-trained model, but omitting, for the moment, the different vocabulary. In this case, the network seem to be quite confused about the termination.

I can't tell with certainty what's the issue here, since I don't have information about your custom data. But I'll try to provide you with some hints:

I hope it helps, let me know your progress if you want Best regards, Jia Cheng

JFcy commented 3 months ago

Hello, recently I annotated over 2000 images and tried to go through the complete training process. The training process went smoothly. However, when I ran demo.py, the following error occurred. How should I deal with this issue? 380 is the length of the vocabulary in my dataset. image

jchenghu commented 3 months ago

Hi!

I'm glad the training went smoothly!

To test your model on demo, since the demo loads the old vocabularies through pickle (to avoid asking users to download coco) you can do something like this in the training file for instance


    coco_dataset = CocoDatasetKarpathy(
        images_path=path_args.images_path,
        coco_annotations_path=path_args.captions_path + "dataset_coco.json",
        preproc_images_hdf5_filepath=path_args.preproc_images_hdf5_filepath if train_args.is_end_to_end else None,
        precalc_features_hdf5_filepath=None if train_args.is_end_to_end else path_args.features_path,
        limited_num_train_images=None,
        limited_num_val_images=5000)
    # replace coco_dataset with your dataset, I'm using it as an example

    # - - - - - - - relevant code here
    with open(<your_pickle_save_path>, 'wb) as f:
         pickle.dump({'word2idx_dict': coco_dataset.caption_word2idx_dict,
                                  'idx2word_list': coco_dataset.caption_idx2word_list,
                                  'sos_str': coco_dataset.get_sos_token_str(),
                                  'eos_str': coco_dataset.get_eos_token_str()
                                 }
                                , f)

    exit(-1);    # just exit
    # - - - - - - -

    spawn_train_processes(...)

Then in the demo file

    with open('./demo_material/demo_coco_tokens.pickle', 'rb') as f:
                         ^^^ <-- change this string to <your_pickle_save_path>
        coco_tokens = pickle.load(f)
        sos_idx = coco_tokens['word2idx_dict'][coco_tokens['sos_str']]
        eos_idx = coco_tokens['word2idx_dict'][coco_tokens['eos_str']]
jchenghu commented 1 month ago

Hi, I'm assuming the issue was solved since it's been two months, feel free to re-open it if you need it