(qagenerator) Apptainer> python prepare_data.py \
--task e2e_qg \
--model_type t5 \
--dataset_path data/squad_v1_pt \
--qg_format highlight_qg_format \
--max_source_length 512 \
--max_target_length 32 \
--train_file_name train_data_e2e_qg_t5_ptbr.pt \
--valid_file_name valid_data_e2e_qg_t5_ptbr.pt
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
07/03/2023 08:55:55 - INFO - nlp.load - Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 08:55:55 - INFO - nlp.load - Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 08:55:55 - INFO - nlp.load - Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4
07/03/2023 08:55:55 - INFO - nlp.load - Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.py
07/03/2023 08:55:55 - INFO - nlp.load - Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/dataset_infos.json
07/03/2023 08:55:55 - INFO - nlp.load - Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.json
Traceback (most recent call last):
File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
main()
File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 536, in load_dataset
builder_instance: DatasetBuilder = builder_cls(
TypeError: 'NoneType' object is not callable
I also tried to copy the data/squad_multitask directory and modify these lines with my URLs:
(qagenerator) Apptainer> python prepare_data.py --task e2e_qg --model_type t5 --dataset_path data/squad_v1_pt --qg_format highlight_qg_format --max_source_length 512 --max_target_length 32 --train_file_name train_data_e2e_qg_t5_ptbr.pt --valid_file_name valid_data_e2e_qg_t5_ptbr.pt
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
warnings.warn(
07/03/2023 09:07:01 - INFO - nlp.load - Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 09:07:02 - INFO - nlp.load - Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 09:07:02 - INFO - nlp.load - Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.load - Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py
07/03/2023 09:07:02 - INFO - nlp.load - Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/dataset_infos.json
07/03/2023 09:07:02 - INFO - nlp.load - Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.json
[nltk_data] Downloading package punkt to /home/U4VN/nltk_data...
[nltk_data] Package punkt is already up-to-date!
07/03/2023 09:07:02 - INFO - nlp.info - Loading Dataset Infos from /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.builder - Generating dataset squad_multitask (/tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec)
Downloading and preparing dataset squad_multitask/highlight_qg_format (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec...
07/03/2023 09:07:02 - INFO - nlp.builder - Dataset not on Hf google storage. Downloading and preparing it from source
07/03/2023 09:07:04 - INFO - nlp.utils.info_utils - Unable to verify checksums.
07/03/2023 09:07:04 - INFO - nlp.builder - Generating split train
0 examples [00:00, ? examples/s]07/03/2023 09:07:04 - INFO - root - generating examples from = /tmp/u4vn/huggingface/datasets/downloads/6bf2e2bfc0769ed6e47c7935079d8584fb3201dd7915b637bbcf0fe3409710a0.4d4fd5bfbda09cd172db9f6f025e9bbf6d4d7d20cd53cef625822e1f2a34dd1f
Traceback (most recent call last):
File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
main()
File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 548, in load_dataset
builder_instance.download_and_prepare(
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 462, in download_and_prepare
self._download_and_prepare(
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 537, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 810, in _prepare_split
for key, record in utils.tqdm(generator, unit=" examples", total=split_info.num_examples, leave=False):
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 239, in _generate_examples
yield count, self.process_qg_text(context, question, qa["answers"][0])
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 144, in process_qg_text
start_pos, end_pos = self._get_correct_alignement(context, answer)
File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 131, in _get_correct_alignement
raise ValueError()
ValueError
Hi,
I'm trying to figure out how to prepare data and fine tuning this T5 base model (https://huggingface.co/unicamp-dl/ptt5-base-t5-vocab) with this squad dataset (https://huggingface.co/datasets/squad_v1_pt)
I downloaded data from hugging face to local folder:
Then run the following command:
But I got this error:
I also tried to copy the
data/squad_multitask
directory and modify these lines with my URLs:The error now is another: