Hi.
The current codebase provided for training/finetuning Colapli1.2 based on hard negatives is not executable and raises multiple errors along the way:
would require the dataloader in load_docmatix_ir_negs to load the Docmatrix dataset from HuggingFace as the anchor dataset which raises the error:
ValueError: Config name is missing. Please pick one among the available configs: ['images', 'pdf', 'zero-shot-exp'] Example of usage: load_dataset('HuggingFaceM4/Docmatix', 'images')
if this issue is resolved and the dataset_transformation function is set to load the images subset of the dataset, an error is raised during the initialization of the HardNegCollator which does not accept a tokenizer as an argument, but one is passed to it in trainer/colmodel_training.py:
TypeError: HardNegCollator.__init__() got an unexpected keyword argument 'tokenizer'
By removing the tokenizer from the collator's init function, another error is raised during calling the collator itself for training the model. The __call__ function of HardNegCollator is supposed to return the image from an example by accessing the gold_index attribute call, which does not exist in the datasets that are loaded (neither docmatrix-ir nor Docmatrix). This error is not resolvable as such an attribute does not exist in the datasets.
Can you please provide the code and the datasets that you used for fine-tuning your model on hard negatives or help with resolving these issues? If that is not possible, I would appreciate it if you can provide instructions on how to fine-tune your model on a custom dataset of hard negatives.
Hi. The current codebase provided for training/finetuning
Colapli1.2
based on hard negatives is not executable and raises multiple errors along the way:running
USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml
would require the dataloader in load_docmatix_ir_negs to load the Docmatrix dataset from HuggingFace as the anchor dataset which raises the error:
ValueError: Config name is missing. Please pick one among the available configs: ['images', 'pdf', 'zero-shot-exp'] Example of usage: load_dataset('HuggingFaceM4/Docmatix', 'images')
if this issue is resolved and the
dataset_transformation
function is set to load theimages
subset of the dataset, an error is raised during the initialization of theHardNegCollator
which does not accept atokenizer
as an argument, but one is passed to it in trainer/colmodel_training.py:TypeError: HardNegCollator.__init__() got an unexpected keyword argument 'tokenizer'
By removing the tokenizer from the collator's init function, another error is raised during calling the collator itself for training the model. The
__call__
function ofHardNegCollator
is supposed to return the image from an example by accessing thegold_index
attribute call, which does not exist in the datasets that are loaded (neither docmatrix-ir nor Docmatrix). This error is not resolvable as such an attribute does not exist in the datasets.Can you please provide the code and the datasets that you used for fine-tuning your model on hard negatives or help with resolving these issues? If that is not possible, I would appreciate it if you can provide instructions on how to fine-tune your model on a custom dataset of hard negatives.
Thank you for your time!