Closed jyyang26 closed 1 week ago
It's not clear what error you're running into because you didn't post a stack trace. However, I do have a section of the readme about how to evaluate the models from the paper, which recommends running this code:
from vec2text import analyze_utils
experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
"jxm/gtr__nq__32__correct"
)
train_datasets = experiment._load_train_dataset_uncached(
model=trainer.model,
tokenizer=trainer.tokenizer,
embedder_tokenizer=trainer.embedder_tokenizer
)
val_datasets = experiment._load_val_datasets_uncached(
model=trainer.model,
tokenizer=trainer.tokenizer,
embedder_tokenizer=trainer.embedder_tokenizer
)
trainer.args.per_device_eval_batch_size = 16
trainer.sequence_beam_width = 1
trainer.num_gen_recursive_steps = 20
trainer.evaluate(
eval_dataset=train_datasets["validation"]
)
which worked for me last time I tried it. Can you try that?
Thanks for your timely reply. After running the evaluation code you gave me, I found the following errors, which seemed to be errors when loading the data set. I searched for the possible cause and found that the problem was in the installation package, and the problem still existed after I updated the installation package. The display may also be a problem with the data set, do you know how to solve it
(base) [jyyang@hostname tests]$ python invert_embed_jx.py
/home/jyyang/anaconda3/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of š¤ Transformers. Use `eval_strategy` instead
warnings.warn(
Set num workers to 4
Experiment output_dir = saves/jxm__gtr__nq__32__correct
/home/jyyang/anaconda3/lib/python3.12/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of š¤ Transformers. Use `eval_strategy` instead
warnings.warn(
Set num workers to 4
Experiment output_dir = saves/jxm__gtr__nq__32
Loading datasets with TOKENIZERS_PARALLELISM = False
>> using fast tokenizers: True True
Running tokenizer on dataset (num_proc=48): 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:09<00:00, 104.06 examples/s]
Running tokenizer on dataset (num_proc=48): 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:09<00:00, 107.09 examples/s]
Running tokenizer on dataset (num_proc=48): 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:09<00:00, 107.33 examples/s]
[Precomputing embeddings with batch size: 512]
saving precomputed embeddings to file: 5c812bc2a204dfaf5e45ee728663e5f9572d52fc0eb17707
Map: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:01<00:00, 766.54 examples/s]
saving precomputed embeddings to file: f9a661eacda517bf5e45ee728663e5f9572d52fc0eb17707
Map: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:00<00:00, 1466.80 examples/s]
saving precomputed embeddings to file: f9a661eacda517bf5e45ee728663e5f9572d52fc0eb17707
Map: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:00<00:00, 1483.48 examples/s]
saving train_dataset to path: /home/jyyang/.cache/inversion/dd0d97ad14fd6897b0d31cecc2e14d13.arrow
Saving the dataset (1/1 shards): 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:00<00:00, 98298.62 examples/s]
Saving the dataset (1/1 shards): 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1000/1000 [00:00<00:00, 55877.86 examples/s]
Saving the dataset (1/1 shards): 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 500/500 [00:00<00:00, 38562.64 examples/s]
Traceback (most recent call last):
File "/home/jyyang/vec2text-master/tests/invert_embed_jx.py", line 3, in <module>
experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/analyze_utils.py", line 172, in load_experiment_and_trainer_from_pretrained
trainer = experiment.load_trainer()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/experiments.py", line 759, in load_trainer
) = vec2text.analyze_utils.load_experiment_and_trainer_from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/analyze_utils.py", line 172, in load_experiment_and_trainer_from_pretrained
trainer = experiment.load_trainer()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/experiments.py", line 631, in load_trainer
train_dataset, eval_dataset = self.load_train_and_val_datasets(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/experiments.py", line 595, in load_train_and_val_datasets
val_datasets_dict = self._load_val_datasets_uncached(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/experiments.py", line 518, in _load_val_datasets_uncached
val_datasets_dict = load_standard_val_datasets()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/data_helpers.py", line 251, in load_standard_val_datasets
"wikibio": load_wikibio_val(),
^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/vec2text/data_helpers.py", line 131, in load_wikibio_val
d = datasets.load_dataset("wiki_bio")["val"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 2606, in load_dataset
builder_instance = load_dataset_builder(
^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 2277, in load_dataset_builder
dataset_module = dataset_module_factory(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 1923, in dataset_module_factory
raise e1 from None
File "/home/jyyang/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 1875, in dataset_module_factory
can_load_config_from_parquet_export = "DEFAULT_CONFIG_NAME" not in f.read()
^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte
(base) [jyyang@hostname tests]$
I'm not sure what the problem is. I'm guessing you ran the command multiple times, and the first run(s) corrupted the download file of Wikibio. I just tested this locally and it works fine for me:
>>> import datasets
>>> datasets.load_dataset("wiki_bio")
DatasetDict({
train: Dataset({
features: ['input_text', 'target_text'],
num_rows: 582659
})
test: Dataset({
features: ['input_text', 'target_text'],
num_rows: 72831
})
val: Dataset({
features: ['input_text', 'target_text'],
num_rows: 72831
})
})
That said, this validation set isn't important. You can just comment this one out (line 251 of data_helpers.py) to get around the problem.
Hi! When I reproduced the paper code, I hope to call the local gtr-t5-base model and the local gtrnq32 and gtrnq32__correct models. However, it is found that after running the following code, a parameter decoder_input_ids or decoder_inputs_embeds is missing during the intermediate parameter passing process. May I ask which part of the code should I modify? The vec2text model demo code is as follows:
gtrnq32__correct config.json file configuration is as follows:
gtrnq32 config.json file configuration is as follows: