Closed JFcy closed 3 months ago
Hi
Regarding the fine-tuning part, I assume you want to change the head of the model (the last prediction layer of size linear(512, num_classes) and replace it with a linear(512, 126) where 126 refers to the number of classes of your custom dataset
You can use the stage 3 checkpoint for faster testing if your input consists of the COCO images since stage 3 checkpoint wants the backbone features as input. If this that's the case that's good.
If your custom dataset consists not only of a different number of classes but is also not part of COCO you might need to update the stage 4 model which is the complete model, comprising of backbone and fusion model
Regarding the output error described by your image, 10.000 is the initial vocabulary size, and 126 seem to be your custom number of classes, to fix this error, I suggest to add something like
def partially_load_state_dict(model, state_dict, verbose=False, max_num_print=5):
own_state = model.state_dict()
max_num_print = max_num_print
count_print = 0
for name, param in state_dict.items():
if name.startswith('vocab_linear'):
continue
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
______________ this bit
if name not in own_state:
if verbose:
print("Not found: " + str(name))
continue
if isinstance(param, Parameter):
param = param.data
own_state[name].copy_(param)
if verbose:
if count_print < max_num_print:
print("Found: " + str(name))
count_print += 1
in the saving utils function, which basically skips the loading of the vocabulary projection at the end and should fix your issue
Let me know if it helps,
I apologize for any misunderstandings my previous statements may have caused. I intended to conduct further training on the model you had already trained, using my own constructed dataset to see what kind of results could be achieved. Therefore, I started training the model you had trained according to the third step in the project's readme, changing the limited_num_val_images value in the dataset to 3, and setting the phase2_checkpoint to the rf_model.pth that you had trained. An error as shown in the above image occurred during the training process. Indeed, as you said, 126 is the number of categories in the dataset I have constructed. Moreover, just now when I tried to modify the content in the partially_load_state_dict function as you suggested, the same error still occurred.
If I intend to further train your model, do I need to ensure that the vocabulary size of my dataset reaches 10,000? Additionally, if I want to proceed with the next training of your well-trained model, I change the phase2_checkpoint to rf_model.pth in step 3, and then continue training according to the readme file in the project, what issues should I pay attention to?
Hi,
Sorry If I did not understand your intentions, let me recap a little bit,
You would like to perform fine-tuning, in the form of end-to-end Cross-Entropy training (since you start from stage 3) on top of the rf_model.pth
which already is an end-to-end checkpoint (contains weights of both backbone and fusion model) with your custom dataset, which consists of 126 labels.
Is that correct?
About the best way to do it, I think it depends on the nature of these 126 labels
Scenario A: if there is no overlap between these 126 labels and the 10.000 that are currently used, then I think you should just get rid of the existing 10000 and put a newly initialized one of 126 classes to be fine-tuned with additional training steps.
Beware however, that if no overlap exists, the weights of the decoder are not helpful anymore, since we are not only discarding the head (as in a typical LLM fashion) but also the input embeddings.
Scenario B: If overlap exists between 126 labels and these 10000 classes (although not the same token, but maybe just semantically?) you can design some mapping or simply discard the ones that are not considered in your domain and preserve the others. This makes use of the existing learned relationship in the current model.
That being said, I would not suggest extending the number of labels to 10.000 for the sole purpose of loading the model in general, and to keep parts of the current vocabulary (both embeddings and the head) only if overlaps between the set of 126 and 10000 classes exist.
In the next steps, I'll assume scenario A for simplicity
To do scenario A, you would like the preserve all the weights in rf_model.pth
except self.vocab_linear
and self.out_embedder
which are previously made of 10000 classes and you would like to replace them with a new head and embeddings of 126 classes.
So, the commands in the README are configured for the alternate end-to-end and non-end-to-end training described in the paper, and they might need to be changed a little bit to adapt them to this scenario...
The commands that are showcased in Step 3 have the effect of loading the weights of the backbone and fusion model from two different checkpoints, which is the reason why you find the distinction between backbone_save_path
and body_save_path
:
python train.py --N_enc 3 --N_dec 3 \
--model_dim 512 --optim_type radam --seed 775533 --sched_type custom_warmup_anneal \
--warmup 1 --lr 3e-5 --anneal_coeff 0.55 --anneal_every_epoch 1 --enc_drop 0.3 \
--dec_drop 0.3 --enc_input_drop 0.3 --dec_input_drop 0.3 --drop_other 0.3 \
--batch_size 16 --num_accum 3 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [3] \
--save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1 \
--is_end_to_end True --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ --partial_load True \
--backbone_save_path ./github_ignore_material/raw_data/swin_large_patch4_window12_384_22k.pth \
--body_save_path ./github_ignore_material/saves/phase2_checkpoint \
--print_every_iter 15000 --eval_every_iter 999999 \
--reinforce False --num_epochs 2 &> output_file.txt &
Since compared to the original intention, we want to load an end-to-end checkpoint that is rf_model.pth
, we can use the backbone_save_path arguments
python train.py --N_enc 3 --N_dec 3 \
--model_dim 512 --optim_type radam --seed 775533 --sched_type custom_warmup_anneal \
--warmup 1 --lr 3e-5 --anneal_coeff 0.55 --anneal_every_epoch 1 --enc_drop 0.3 \
--dec_drop 0.3 --enc_input_drop 0.3 --dec_input_drop 0.3 --drop_other 0.3 \
--batch_size 16 --num_accum 3 --num_gpus 1 --ddp_sync_port 11317 --eval_beam_sizes [3] \
--save_path ./github_ignore_material/saves/ --save_every_minutes 60 --how_many_checkpoints 1 \
--is_end_to_end True --images_path ./github_ignore_material/raw_data/MS_COCO_2014/ \
--backbone_save_path <PATH_TO_YOUR_RF_MODEL.PTH>
--print_every_iter 15000 --eval_every_iter 999999 \
--reinforce False --num_epochs 2 &> output_file.txt &
(Basically I removed --partial_load
and body_save_path
, and only --save_path
and --backbone_save_path
remains)
(obvious remark: this is copied from the readme, other configurations are up to your custom dataset :-))
Basically, despite what the name suggests backbone_save_path
actually loads the entire end-to-end model and then if body_save_path is specified, it loads the weights contained in body_save_path
on top of it. So we can leverage backbone_save_path to load the rf_model.pth, which is end-to-end and set body_save_path to an empty string
In the train.py file, you can change this part
if train_args.is_end_to_end:
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location)
if 'model' in checkpoint.keys():
partially_load_state_dict(model.swin_transf, checkpoint['model'])
elif 'model_state_dict' in checkpoint.keys():
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Backbone loaded...", end=' ')
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Body loaded")
into
if train_args.is_end_to_end:
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location)
if 'model' in checkpoint.keys():
partially_load_state_dict(model.swin_transf, checkpoint['model'])
print("Backbone loaded...")
elif 'model_state_dict' in checkpoint.keys():
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Entire model backbone loaded...")
v ---------------------------------------------------------------- Changed here
if path_args.body_save_path != '':
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Body also loaded")
So it does not necessarily also upload the body_save_path anymore, to support the command I showcased above
Now, to make everything work in your case, where you want also to change the vocabulary/embeddings of the model (from 10000 to 126), there are multiple options:
1) Go inside End_ExpansionNet_v2.py and rename self.vocab_linear
into self.custom_vocab_linear
(up to you which name to assign) and self.out_embedder
to self.custom_out_embedder
. In this way, since the partial loading stage looks for all the identical names between the checkpoint file and the new instance of the model, the old weights (made of 10000 classes) will be discarded and you can use your freshly created weights of (126 classes) on top of the other trained weights.
2) Make a custom condition in the saving utils where you ask to specifically skip the vocab_linear and self.out_embedder module like the code I partially showed above.
This can be error-prone if you modify the existing partial_load function as I did above (sorry) since we don't want to always discard the vocabulary and embeddings across all different training stages. It might create undesired behaviors. So it can be better to create a new argument like '--pretrained_model_save_path' which performs the custom partial_load described in this step without changing the existing one (so it does not break the code for the standard training phases).
Let's assume you go with the first one, which is the quickest and safest, once you have renamed the existing structures into self.custom_vocab_linear and self.custom_out_embedder, and during the creation, you ensure the new number of classes is different
model = End_ExpansionNet_v2( ...
output_word2idx=your_dataset.caption_word2idx_dict,
output_idx2word=your_dataset.caption_idx2word_list,
... ^^^^ your_dataset.caption_word2idx_dict / list made of 126 entries
rank=rank)
Now, with the command above, during the loading stage, since the names are now different, the weights associated with the 10000 classes from the old checkpoints are discarded, and you should be good to go.
In conclusion:
Sorry for the long response, I wanted to give you a complete perspective of the problem so you can choose the most suitable solution;
I can't test the code at the moment, so before moving on to the next stages let me know if the stage 3 end-to-end cross-entropy training works with the changes I suggested here
about the best practices, I honestly don't know what the impact of changing the classes into completely different ones, it is probably the best to preserve at least a portion of the elements in the vocabulary or embedding (maybe through some mapping if tokens are not exactly the same), but in the case this is not possible, I believe the encoder weights will still be useful, but I'm not certain about the decoder's weights, since also the input embedders are different... I'm interested in your results, please let me know :-)
Let's keep in touch, Best
As you suggested in scenario A, I change self.vocab_linear into self.custom_vocab_linear and self.out_embedder to self.custom_out_embedder but when I run the code , the error has changed to .
Also I tried options 2 , skipping the the vocab_linear and self.out_embedder module like:
if name.startswith('vocab_linear') or name.startswith('out_embedder'):
continue
it didn't work too. I print the name of state_dict.items(),
for name, param in state_dict.items(): if name.startswith('vocab_linear') or name.startswith('out_embedder'): continue print("++++++++++",name)
the result is
It seems like the pos_encoder is mismatch.
hi, yeah in both cases the issue seems to lie in the positional encoder, it is currently implemented with additional embeddings instead of the classic parameterless positional encoders. I think the most straightforward solution would be artificially setting the maximum sequence length to 74 instead of 58 (the one of your custom dataset) during the construction.
model = End_ExpansionNet_v2( ...
max_seq_len=74
)
Since the maximum sequence lenght in your custom dataset is smaller than the original one, it should face no consequences
Let me know if it works
I'm sorry for the delayed response. While running the code, I encountered numerous small issues, which took a considerable amount of time to resolve. Ultimately, I still faced the following problems:
Firstly, I encountered a "cuda out of memory" error. Here is the status of my GPU.
Previously, I tried adjusting the train batch size and using CUDA_VISIBLE_DEVICES=3 to run the code, which allowed the entire process to run successfully. However, after modifying my code according to your suggestions and running it again using CUDA_VISIBLE_DEVICES=3, the code always stops at
checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location).
I haven't been able to identify the cause of this issue. I wonder if you might have any suggestions.
In the course of my research, I also tried to follow your initial suggestion to modify the model's head in the original version by changing linear(512, num_classes) to linear(512, 126). I made the following changes to the models.End_ExpansionNet_v2.py and models.ExpansionNet_v2.py files.
# self.custom_vocab_linear = torch.nn.Linear(d_model, len(output_word2idx)) self.custom_vocab_linear = torch.nn.Linear(d_model, 126)
However, I encountered a "cuda error: device-side assert triggered" issue. Does this mean I should modify other parts of the code besides linear()? Do you have any suggestions for this?
Thank you for your patience and help.
Hi,
About the first issue, It is indeed strange, if the code was running fine before the modification with multiple GPUs, theoretically speaking, since my suggestions were headed toward reducing the current model size, it should not give issues during the torch.load in particular...
let's try this quick fix... can you try replacing map_location with map_location='cuda:%d' % rank
and let me know if it works? It might be a bug in my code
About the second point, I would suggest you preserve len(output_word2idx))
instead of 126
just to make sure you're not leaving out some special tokens in the dictionary, but I agree this should not have been the issue given the previous output you showed me.
I'll paste here what I had in mind
In the constructor
class End_ExpansionNet_v2(CaptioningModel):
def __init__(self,
...
):
super(End_ExpansionNet_v2, self).__init__()
self.swin_transf = ...
...
self.input_embedder_dropout = nn.Dropout(drop_args.enc_input)
self.input_linear = torch.nn.Linear(final_swin_dim, d_model)
self.custom_vocab_linear = torch.nn.Linear(d_model, len(output_word2idx))
^^^ Add prefix custom_ here
self.log_softmax = nn.LogSoftmax(dim=-1)
self.out_enc_dropout = nn.Dropout(drop_args.other)
self.out_dec_dropout = nn.Dropout(drop_args.other)
self.custom_out_embedder = EmbeddingLayer(len(output_word2idx), d_model, drop_args.dec_input)
^^^^ Add prefix custom here
self.pos_encoder = nn.Embedding(max_seq_len, d_model)
^^^^ do not change here
...
replicate the edit in the forward_decoder
:
def forward_dec(...):
...
y = self.out_embedder(dec_input) --> y = self.custom_out_embedder(dec_input)
...
y = self.vocab_linear(y) --> y = self.custom_vocab_linear(y)
Also, make sure to manually set the length to 76 during the instantiation, for reasons previously discussed
if train_args.is_end_to_end:
from models.End_ExpansionNet_v2 import End_ExpansionNet_v2
model = End_ExpansionNet_v2(
...
max_seq_len=76,
^^^ <--- here
...
)
About the reason why the device asserts trigger, If I have to guess, the data at some point asked the embeddings a higher index than what was expected. If you created the custom dataset based on the current dataset and data loaders, try using torch.nn.Linear(d_model, len(output_word2idx))
... If you already did and did not work, I'd like to know the complete error code of the device assert trigger
Let me know if it helps :-)
I had change the code: and change the 126 to len(output_word2idx)) .
if train_args.is_end_to_end:
# map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
map_location = {'cuda:%d' % rank}
checkpoint = torch.load(path_args.backbone_save_path, map_location=map_location)
if 'model' in checkpoint.keys():
partially_load_state_dict(model.swin_transf, checkpoint['model'])
print("Backbone loaded...")
elif 'model_state_dict' in checkpoint.keys():
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Entire model backbone loaded...")
if path_args.body_save_path != '':
map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Body also loaded")
else:
if train_args.partial_load:
# map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
map_location = {'cuda:%d' % rank}
checkpoint = torch.load(path_args.body_save_path, map_location=map_location)
partially_load_state_dict(model, checkpoint['model_state_dict'])
print("Partial load done.")
and the output showed in output.txt: It has been about 5 minutes, and it is still in this state just as I said before, it keeps stopping at this position.
Mmh.. it says num_gpus: 1 try setting '--num_gpus 4' argument
I finally managed to train normally by modifying the caption_idx2word_list parameter to be the same as the content of the COCO dataset. However, the following results occurred. As shown in the picture, the results in the upper box are from the training, and the results below are from the originally pre-trained model. For the large number of repeated words appearing in the captions, I am not sure if it is caused by a certain parameter? Or is it because the amount of data I trained is too small? I would like to know if you have any valuable suggestions?
Hi,
I'm glad you worked it out, although it should be working even with a different set of tokens in the vocabulary, I have a curiosity at this point, but feel free to not disclose details about your work if you don't want to: in the previous images, it showcased about 200 tokens, in your custom problem. Were they a subset of the current dictionary ? Or were they completely different classes?
This detail might be important in understanding its behaviour. My deduction is that, you were curious about the result in the case your custom dataset is directly fed into the pre-trained model, but omitting, for the moment, the different vocabulary. In this case, the network seem to be quite confused about the termination.
I can't tell with certainty what's the issue here, since I don't have information about your custom data. But I'll try to provide you with some hints:
it can be the case that, the training set is indeed very poor, 100 images is not much for these models, I would suggest increasing the number to at least ~1000 if possible (Keep in mind however that data quality is also a factor and it can't be extrapolated from the number of images alone, be sure that data is balanced and meaningful for your application).
Besides dataset size, the behavior is indeed weird. It seems to lose the capability of performing correct termination. I'm not sure about the causes, but in my (biased) opinion, data overfitting should not lead to such behavior, so I suggest ensuring the "Eos" token is correctly configured in your training set, and that the target in your loss function properly rewards the termination token (one possible risk ifyi).
if you think the code is fine, then the issue can be related to the way you mapped the old vocabulary to the new one. For instance, the Eos token can be found in the first or last places of the vocabulary, make sure you did not replace it with new custom tokens during the fine-tuning
I hope it helps, let me know your progress if you want Best regards, Jia Cheng
Hello, recently I annotated over 2000 images and tried to go through the complete training process. The training process went smoothly. However, when I ran demo.py, the following error occurred. How should I deal with this issue? 380 is the length of the vocabulary in my dataset.
Hi!
I'm glad the training went smoothly!
To test your model on demo, since the demo loads the old vocabularies through pickle (to avoid asking users to download coco) you can do something like this in the training file for instance
coco_dataset = CocoDatasetKarpathy(
images_path=path_args.images_path,
coco_annotations_path=path_args.captions_path + "dataset_coco.json",
preproc_images_hdf5_filepath=path_args.preproc_images_hdf5_filepath if train_args.is_end_to_end else None,
precalc_features_hdf5_filepath=None if train_args.is_end_to_end else path_args.features_path,
limited_num_train_images=None,
limited_num_val_images=5000)
# replace coco_dataset with your dataset, I'm using it as an example
# - - - - - - - relevant code here
with open(<your_pickle_save_path>, 'wb) as f:
pickle.dump({'word2idx_dict': coco_dataset.caption_word2idx_dict,
'idx2word_list': coco_dataset.caption_idx2word_list,
'sos_str': coco_dataset.get_sos_token_str(),
'eos_str': coco_dataset.get_eos_token_str()
}
, f)
exit(-1); # just exit
# - - - - - - -
spawn_train_processes(...)
Then in the demo file
with open('./demo_material/demo_coco_tokens.pickle', 'rb') as f:
^^^ <-- change this string to <your_pickle_save_path>
coco_tokens = pickle.load(f)
sos_idx = coco_tokens['word2idx_dict'][coco_tokens['sos_str']]
eos_idx = coco_tokens['word2idx_dict'][coco_tokens['eos_str']]
Hi, I'm assuming the issue was solved since it's been two months, feel free to re-open it if you need it
Hello,
I am looking to fine-tune your project recently. Could you please advise at which step I should replace the .pth with the pre-trained model? When I ran the code myself, I started from step three, replacing phase2_checkpoint with the pre-trained model, but encountered the following issues. Could you please offer any solutions?