feat: nomic-embed-text-v1

zanussbaum commented 6 months ago

I had to make two quick changes

allowing token_type_ids to be passed
passing a prefix to the tokenization step of the dataset

Let me know if there may be other places I need to pass the prefix.

Tested code with

torchrun --nproc_per_node=8 run.py --experiment inversion --dataset_name msmarco --per_device_train_batc
h_size 256 --per_device_eval_batch_size 256 --max_seq_length 128 --model_name_or_path t5-base --embedder_model_name nomic-ai/nomic-embed-text-v1 --num_r
epeat_tokens 16 --embedder_no_grad True --learning_rate 0.004 --use_frozen_embeddings_as_input True --num_train_epochs 100 --max_eval_samples 1000 --eva
l_steps 10000 --warmup_steps 10000 --bf16=1

zanussbaum commented 6 months ago

@jxmorris12 i'm not totally following, why would they have to retokenize? I'm happy to make some changes so it's somewhat backwards compatible.

jxmorris12 commented 6 months ago

Because HuggingFace caches based on a hashed signature of the tokenizer function. But I remembered we disabled that and cache manually. So it's totally fine.

jxmorris12 / vec2text

feat: nomic-embed-text-v1 #29