patil-suraj / question_generation

Neural question generation using transformers
MIT License
1.11k stars 348 forks source link

Fine tuning a T5 model with another language #110

Open vabatista opened 1 year ago

vabatista commented 1 year ago

Hi,

I'm trying to figure out how to prepare data and fine tuning this T5 base model (https://huggingface.co/unicamp-dl/ptt5-base-t5-vocab) with this squad dataset (https://huggingface.co/datasets/squad_v1_pt)

I downloaded data from hugging face to local folder: image

Then run the following command:

python prepare_data.py \
    --task e2e_qg \
    --model_type t5 \
    --dataset_path data/squad_v1_pt \
    --qg_format highlight_qg_format \
    --max_source_length 512 \
    --max_target_length 32 \
    --train_file_name train_data_e2e_qg_t5_ptbr.pt \
    --valid_file_name valid_data_e2e_qg_t5_ptbr.pt 

But I got this error:

(qagenerator) Apptainer> python prepare_data.py \
    --task e2e_qg \
    --model_type t5 \
    --dataset_path data/squad_v1_pt \
    --qg_format highlight_qg_format \
    --max_source_length 512 \
    --max_target_length 32 \
    --train_file_name train_data_e2e_qg_t5_ptbr.pt \
    --valid_file_name valid_data_e2e_qg_t5_ptbr.pt 
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
07/03/2023 08:55:55 - INFO - nlp.load -   Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 08:55:55 - INFO - nlp.load -   Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 08:55:55 - INFO - nlp.load -   Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4
07/03/2023 08:55:55 - INFO - nlp.load -   Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.py
07/03/2023 08:55:55 - INFO - nlp.load -   Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/dataset_infos.json
07/03/2023 08:55:55 - INFO - nlp.load -   Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/65162e0fbe44f19a4d2ad9f5f507d2e965e74249fc3239dc78b4e3bd93bab7c4/squad_v1_pt.json
Traceback (most recent call last):
  File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
    main()
  File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
    train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 536, in load_dataset
    builder_instance: DatasetBuilder = builder_cls(
TypeError: 'NoneType' object is not callable

I also tried to copy the data/squad_multitask directory and modify these lines with my URLs:

    _URL = "https://github.com/nunorc/squad-v1.1-pt/raw/master/"
    _DEV_FILE = "dev-v1.1-pt.json"
    _TRAINING_FILE = "train-v1.1-pt.json"

The error now is another:

(qagenerator) Apptainer> python prepare_data.py     --task e2e_qg     --model_type t5     --dataset_path data/squad_v1_pt     --qg_format highlight_qg_format     --max_source_length 512     --max_target_length 32     --train_file_name train_data_e2e_qg_t5_ptbr.pt     --valid_file_name valid_data_e2e_qg_t5_ptbr.pt 
/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/transformers/models/t5/tokenization_t5.py:163: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(
07/03/2023 09:07:01 - INFO - nlp.load -   Checking data/squad_v1_pt/squad_v1_pt.py for additional imports.
07/03/2023 09:07:02 - INFO - nlp.load -   Found main folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt
07/03/2023 09:07:02 - INFO - nlp.load -   Found specific version folder for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.load -   Found script file from data/squad_v1_pt/squad_v1_pt.py to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py
07/03/2023 09:07:02 - INFO - nlp.load -   Found dataset infos file from data/squad_v1_pt/dataset_infos.json to /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/dataset_infos.json
07/03/2023 09:07:02 - INFO - nlp.load -   Found metadata file for dataset data/squad_v1_pt/squad_v1_pt.py at /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.json
[nltk_data] Downloading package punkt to /home/U4VN/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
07/03/2023 09:07:02 - INFO - nlp.info -   Loading Dataset Infos from /projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec
07/03/2023 09:07:02 - INFO - nlp.builder -   Generating dataset squad_multitask (/tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec)
Downloading and preparing dataset squad_multitask/highlight_qg_format (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /tmp/u4vn/huggingface/datasets/squad_multitask/highlight_qg_format/1.0.0/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec...
07/03/2023 09:07:02 - INFO - nlp.builder -   Dataset not on Hf google storage. Downloading and preparing it from source
07/03/2023 09:07:04 - INFO - nlp.utils.info_utils -   Unable to verify checksums.
07/03/2023 09:07:04 - INFO - nlp.builder -   Generating split train
0 examples [00:00, ? examples/s]07/03/2023 09:07:04 - INFO - root -   generating examples from = /tmp/u4vn/huggingface/datasets/downloads/6bf2e2bfc0769ed6e47c7935079d8584fb3201dd7915b637bbcf0fe3409710a0.4d4fd5bfbda09cd172db9f6f025e9bbf6d4d7d20cd53cef625822e1f2a34dd1f
Traceback (most recent call last):  
  File "/projetos/u4vn/question_generation/prepare_data.py", line 204, in <module>
    main()
  File "/projetos/u4vn/question_generation/prepare_data.py", line 155, in main
    train_dataset = nlp.load_dataset(data_args.dataset_path, name=data_args.qg_format, split=nlp.Split.TRAIN)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 537, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/builder.py", line 810, in _prepare_split
    for key, record in utils.tqdm(generator, unit=" examples", total=split_info.num_examples, leave=False):
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 239, in _generate_examples
    yield count, self.process_qg_text(context, question, qa["answers"][0])
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 144, in process_qg_text
    start_pos, end_pos = self._get_correct_alignement(context, answer)
  File "/projetos/u4vn/.venv/qagenerator/lib/python3.9/site-packages/nlp/datasets/squad_v1_pt/626b63322487b08450abd3191448d102ac4da9e41180757abb9b8013aa95f0ec/squad_v1_pt.py", line 131, in _get_correct_alignement
    raise ValueError()
ValueError