huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.09k stars 809 forks source link

Bug in FastTokenizer #626

Closed HeroadZ closed 3 years ago

HeroadZ commented 3 years ago

Hi, I'm trying to fine-tune T5 model with the scripts in transformers/examples folder.

git clone https://github.com/huggingface/transformers
cd transformers
pip install .
cd examples/seq2seq
pip install -r requirements.txt

tokenizer: 0.10.1 transformers: 4.4.0.dev0

So when I fine-tune T5 model with XSum or CNN/DailyMail dataset, it works well. However, when I fine-tune it with the newsroom dataset. Errors occurred from fast tokenizers.

(examples) [zchelllo@enigma03 transformers]$ CUDA_VISIBLE_DEVICES=0 python examples/seq2seq/run_seq2seq.py     --model_name_or_path t5-small  --do_train    --do_eval     --task summarization     --train_file ~/summary/release/newsroom_train.csv --validation_file ~/summary/release/newsroom_val.csv     --output_dir ~/summary/pre_model/t5_small_newsroom      --overwrite_output_dir  --per_device_train_batch_size=16     --per_device_eval_batch_size=16   --save_total_limit=5   --predict_with_generate --text_column text --summary_column summary 
02/11/2021 11:04:49 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
02/11/2021 11:04:49 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/home/zchelllo/summary/pre_model/t5_small_newsroom', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=16, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Feb11_11-04-49_enigma03.yama.info.waseda.ac.jp', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=5, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/home/zchelllo/summary/pre_model/t5_small_newsroom', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, sortish_sampler=False, predict_with_generate=True)
Using custom data configuration default
Reusing dataset csv (/home/zchelllo/.cache/huggingface/datasets/csv/default-0f03a64507b552a3/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)
loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /home/zchelllo/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
02/11/2021 11:04:51 - INFO - transformers.configuration_utils -   loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /home/zchelllo/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.4.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

02/11/2021 11:04:51 - INFO - transformers.configuration_utils -   Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.4.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /home/zchelllo/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
02/11/2021 11:04:52 - INFO - transformers.configuration_utils -   loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /home/zchelllo/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.4.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

02/11/2021 11:04:52 - INFO - transformers.configuration_utils -   Model config T5Config {
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_length": 30,
      "no_repeat_ngram_size": 3,
      "num_beams": 4,
      "prefix": "summarize: "
    },
    "translation_en_to_de": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to German: "
    },
    "translation_en_to_fr": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to French: "
    },
    "translation_en_to_ro": {
      "early_stopping": true,
      "max_length": 300,
      "num_beams": 4,
      "prefix": "translate English to Romanian: "
    }
  },
  "transformers_version": "4.4.0.dev0",
  "use_cache": true,
  "vocab_size": 32128
}

loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /home/zchelllo/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
02/11/2021 11:04:53 - INFO - transformers.tokenization_utils_base -   loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /home/zchelllo/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /home/zchelllo/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
02/11/2021 11:04:53 - INFO - transformers.tokenization_utils_base -   loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /home/zchelllo/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
loading weights file https://huggingface.co/t5-small/resolve/main/pytorch_model.bin from cache at /home/zchelllo/.cache/huggingface/transformers/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885
02/11/2021 11:04:54 - INFO - transformers.modeling_utils -   loading weights file https://huggingface.co/t5-small/resolve/main/pytorch_model.bin from cache at /home/zchelllo/.cache/huggingface/transformers/fee5a3a0ae379232608b6eed45d2d7a0d2966b9683728838412caccc41b4b0ed.ddacdc89ec88482db20c676f0861a336f3d0409f94748c209847b49529d73885
All model checkpoint weights were used when initializing T5ForConditionalGeneration.

02/11/2021 11:04:56 - INFO - transformers.modeling_utils -   All model checkpoint weights were used when initializing T5ForConditionalGeneration.

All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
02/11/2021 11:04:56 - INFO - transformers.modeling_utils -   All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
 22%|████████████████████                                                                         | 215/996 [02:44<14:09,  1.09s/ba]thread '<unnamed>' panicked at 'index out of bounds: the len is 6200 but the index is 6200', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "examples/seq2seq/run_seq2seq.py", line 604, in <module>
    main()
  File "examples/seq2seq/run_seq2seq.py", line 445, in main
    load_from_cache_file=not data_args.overwrite_cache,
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1260, in map
    update_data=update_data,
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 157, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/datasets/fingerprint.py", line 163, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1521, in _map_single
    batch, indices, check_same_num_examples=len(self.list_indexes()) > 0, offset=offset
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1439, in apply_function_on_filtered_inputs
    function(*fn_args, effective_indices, **fn_kwargs) if with_indices else function(*fn_args, **fn_kwargs)
  File "examples/seq2seq/run_seq2seq.py", line 420, in preprocess_function
    model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2347, in __call__
    **kwargs,
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2532, in batch_encode_plus
    **kwargs,
  File "/home/zchelllo/.pyenv/versions/3.6.9/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 380, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
pyo3_runtime.PanicException: index out of bounds: the len is 6200 but the index is 6200
 22%|████████████████████                                                                         | 215/996 [02:45<10:00,  1.30ba/s]

But if I add --use_fast_tokenizer false, it worked well. I think the error is related to the fast tokenizer, do you have any idea?

n1t0 commented 3 years ago

Hi @HeroadZ and thank you for reporting this.

Would you be able to create a minimal way to reproduce this? Maybe you could try to isolate the failing call to the tokenizer.

HeroadZ commented 3 years ago

@n1t0 Hi, I found the unnormal text, and please check it.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
>>> s = 'By Robert Barnes The Supreme Court on Tuesday stopped Arizona from distributing campaign subsidies to publicly funded candidates facing big-spending opponents.\n\nThe court granted a stay request from opponents of a decade-old law that subsidizes state candidates who agree to spend only public money on their campaigns. The high court will decide whether to review lower court decisions.\n\nThe subsidies are an attempt to blunt the influence of campaign contributors. In order to keep the publicly financed candidates from being roundly outspent, new subsidies are doled out according to the fundraising and spending of their privately financed opponents.\n\nBut those candidates, some of whom are self-financed, say the law forces them to limit their spending to avoid triggering more public money for their opponents.\n\nA federal judge in Arizona said that made the law unconstitutional. But the U.S. Court of Appeals for the 9th Circuit disagreed, setting up the issue for the high court. Opponents of the law asked the court to stop the next round of public payments, which are scheduled for June 22, while deciding whether to hear the case.\n\nA brief submitted by an intervener in the case, Clean Elections Institute, said disallowing the subsidies would "likely distort the outcome of the 2010 elections in Arizona."\n\nAs an example, it pointed to the governor\'s race. Gov. Jan Brewer (R), a publicly funded candidate, is eligible to receive more than $2.1 million under the current plan. "If matching funds were enjoined, that amount will drop by 66 percent to $707,447." Her privately financed GOP opponent Buz Mills, the brief said, already has spent nearly $2.3 million.\n\nAccording to the court\'s order, the stay would dissolve if the court decided not to take the case. The decision on the stay, as is customary, came without explanation. There were no noted dissents.\n\nBy Robert Barnes | June 8, 2010; 11:36 AM ET Categories: 2010 Election , 44 The Obama Presidency , 50 States Save & Share: Previous: Fiorina campaigning on her record at HP Next: One Arkansas voter, many reasons\n\nTOO MANY midwesterners/northeasters/emigress to AZ. I am a native Arizonan, and these newcomers WHO CONTINUE TO COMPLAIN have left their FAILED STATES where they DO NOT pay taxes USE PUBLIC entities, i.e. libraries, law enforcement, hospitals, libraries, infrastructers, et al. These are members of the 21st century KKK\n\nPosted by: neec13 | June 8, 2010 10:17 PM | Report abuse\n\nDear customers, thank you for your support of our company. Here, there\'s good news to tell you: The company recently launched a number of new fashion items! ! Fashionable and welcome everyone to come buy. If necessary, welcometo: ===== http://www.smalltrade.net =====\n\nfree shipping competitive price any size available accept the paypal\n\nHandbags(Coach l v f e n d i d&g) $35\n\nSunglasses(Oakey,coach,gucci,A r m a i n i) $16\n\n====== http://www.smalltrade.net ===== ` â\x95°â\x80\x94â\x94\x98 ã\x80\x82 â\x94 â\x98 `_ã\x80\x81 â\x94\x82\__â\x95\xadâ\x95\xadâ\x95\xadâ\x95\xadâ\x95\xad__ï¼\x8fâ\x94\x82 ã\x80\x80ã\x80\x80 â\x94\x82ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80â\x94\x82 ã\x80\x80 â\x94\x82ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80â\x94\x82ã\x80\x80 â\x94\x82ã\x80\x80â\x97\x8fã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80â\x97\x8fã\x80\x80â\x94\x82ã\x80\x80 â\x94\x82â\x89¡ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ï½\x8fã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80â\x89¡â\x94\x82 â\x94\x82ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80ã\x80\x80â\x94\x82ã\x80\x80 â\x95°â\x80\x94â\x80\x94â\x94¬ï¼¯â\x97¤â\x96½â\x97¥ï¼¯â\x94¬â\x80\x94â\x80\x94â\x95¯ ã\x80\x80ã\x80\x80ã\x80\x80ï½\x9cã\x80\x80ã\x80\x80ï½\x8fã\x80\x80ã\x80\x80ï½\x9c ã\x80\x80ã\x80\x80ã\x80\x80ï½\x9câ\x95\xadï¼\x8dï¼\x8dï¼\x8dâ\x95®ï½\x9c â\x94\x8câ\x94¬â\x94¬â\x94¬â\x94\x90ã\x80\x80 â\x95\x9eâ\x95§â\x95§â\x95§â\x95§â\x95\x90â\x95\x90â\x95§â\x95§â\x95§â\x95§â\x95§â\x95§â\x95§â\x95§â\x95¡\n\nPosted by: itkonlyyou108 | June 8, 2010 9:55 PM | Report abuse\n\nAre our elected representatives up for sale to the highest bidder/campaign contributor? Why not just have Exxon and Aetna and the other major corporations in this nation send employees to fill the spots? It would be more intellectually honest than what we are pretending to do now.\n\nNor does a link to a comparable taxpayer funded match make much sense. That\'s a really perverse incentive to run "free" public campaigns against the best funded opponents on the taxpayers tab.\n\nPosted by: smrtmx | June 8, 2010 8:26 PM | Report abuse\n\nArizona needs to get even. Actually, the entire West does. The Supreme Court is dominated by Northeastern nitwits that haven\'t got a clue how we think. So, teach them a lesson. Take all of those illegals rounded up and send them on buses or trains to Washington D.C. Then sit back and watch. The ensuing sky rocketing crime rate, the brutal gang warfare, where hispanic gangs take great ride in including the general pubic, the identity theft, home invasion and armed robberies, stolen cars, rapes, pedophilia, welfare fraud, homicides, burglaries, drain on social services and schools and health services, is precisely what these fools deserve. Don\'t waste a golden opportunity! Oh, and New York City could really use a couple million, too.\n\nPosted by: mibrooks27 | June 8, 2010 8:09 PM | Report abuse\n\nSCOTUS froze public funding in Arizona, thus giving a tremendous advantage to the millionaires and other privately funded candidates. No matter how it eventually rules, it has "fixed" this year\'s elections in Arizona. Releasing the funds in 4-6 months is too late. Doesn\'t anyone think there\'s something wrong with that?\n\nArizona\'s laws are presumed valid and constitutional until found otherwise. Yet, the injunction against public matching funds stops funding now. BEFORE THE LAW IS FOUND UNCONSTITUTIONAL ____________________________ it was found unconstitutional, by the Arizona judge. the state appealed, and the 9th circuit reversed. that opinion is stayed, leaving the trial court\'s ruling intact.\n\nPosted by: JoeT1 | June 8, 2010 6:18 PM | Report abuse\n\nDemocracy is dead when elections are for sale to the candidate with the most money for slick advertising devoid of honesty or information. Of course, the candidate without millions to spend can always exercise his or her freedom of speech on street corners.\n\nPosted by: neo-nemesis | June 8, 2010 6:13 PM | Report abuse\n\nArizona hang in there!!! Brewer the feds are just trying to cut you out of the race any way they can because they know and we know you will be in office again. If we donate money to your cause. The government cannot stop us nor you. After all look at our so called named president. He does not abide by our rules therefore he cannot stop us.What I would like to see is you in the oval office.The feds know that. So they have to try and stop you. We wont let that happen. Stay strong Arizona!!! Remember our founding fathers gave the states their own power for a reason, and the feds cannot stop it, but they are trying. You are a very dangerous threat to them. God Bless You!! Stay Strong! We are behind you all the way!\n\nPosted by: Lana5 | June 8, 2010 5:54 PM | Report abuse\n\nSo far, the SCOTUS is batting 1000 for Big Money and Corporations. No surprise since they were selected by the Corporate-backed Bush Crime Family.\n\nNO Government-regulation. NO limits on "endowed by their creator" free-speech right to fund political Campaigns. No limit to police arrest and interrogation.\n\nThe true Party of NO!!\n\nPosted by: thebobbob | June 8, 2010 5:35 PM | Report abuse\n\nHow long until this neo-con SCOTUS attempts to implement a $250,000 property requirement for voting. The neo-con pitch, America for Sale to the highest bidder.\n\nPosted by: jmdziuban1 | June 8, 2010 5:17 PM | Report abuse\n\nI agree with farmsnorton. If obama\'s for this, count me against it too. He\'s shifty, and things are orchestrated to serve his own agenda. ********************************************* That\'s pretty much what it comes down to, isn\'t it? Someone in this thread said we need more transparency and smarter voters. Bring on the transparency. Voters like the one quoted above are a lost cause.\n\nPosted by: st50taw | June 8, 2010 5:06 PM | Report abuse\n\nI agree with farmsnorton. If obama\'s for this, count me against it too. He\'s shifty, and things are orchestrated to serve his own agenda. ********************************************* That\'s pretty much what it comes down, isn\'t it? Someone in this thread said we need more transparency and smarter voters. Bring on the transparency. The smart voters are a lost cause.\n\nPosted by: st50taw | June 8, 2010 5:04 PM | Report abuse\n\nAmerica is seeing firsthand a classic showcase of millionaires trying to buy elections and take over the United States government.\n\nThe Republican Meg Whitman (billionaire), who never EVER even voted until she decided she was going buy the California Governorship this year is a prime example of somebody that has NO business in politics.\n\nWithout campaign finance laws the electoral exploitation by the super wealthy will only ratchet up until it is too late and Democracy in Americas will only be found in history books.\n\nSo this is the \'LESS-GOVERNMENT\' those tea baggers are promoting alongside the Libertarian\'s Jim Crow Laws!!!\n\nUnless the American People put down their damn cell phones for a minute and start paying closer attention to election issues where they do coherently understand what is about to happen - American Democracy is doomed.\n\nA 2nd American civil war is coming...\n\nSee Ken Burns documentaries about The Great West and The American Civil War if you want to see history about to repeat itself in the very ugliest way.\n\nPosted by: danglingwrangler | June 8, 2010 4:08 PM | Report abuse\n\nThe RATS contingent on SCOTUS has already decided that elections are up for the highest bidder. Now they want to make sure that there are no matching funds for candidates who cannot bid as high as the SCOTUS cronies. Bah.\n\nPosted by: frodot | June 8, 2010 4:05 PM | Report abuse\n\nAmerica is seeing firsthand a classic showcase of millionaire trying to buy elections and take over the government.\n\nWithout campaign finance laws the electoral exploitation by the super wealthy will only ratchet up until it is too late and Democracy in Americas will only be found in history books.\n\nSo this is the \'LESS-GOVERNMENT\' those tea baggers are promoting alongside the Libertarian\'s Jim Crow Laws!!!\n\nUnless the American People put down their damn cell phones for a minute and start paying closer attention to election issues where they do coherently understand what is about to happen - American Democracy is doomed.\n\nA 2nd American civil war is coming...\n\nSee Ken Burns documentaries about The Great West and The American Civil War if you want to see history about to repeat itself in the very ugliest way.\n\nPosted by: danglingwrangler | June 8, 2010 4:04 PM | Report abuse\n\nSCOTUS froze public funding in Arizona, thus giving a tremendous advantage to the millionaires and other privately funded candidates. No matter how it eventually rules, it has "fixed" this year\'s elections in Arizona. Releasing the funds in 4-6 months is too late. Doesn\'t anyone think there\'s something wrong with that?\n\nArizona\'s laws are presumed valid and constitutional until found otherwise. Yet, the injunction against public matching funds stops funding now. BEFORE THE LAW IS FOUND UNCONSTITUTIONAL.\n\nPosted by: Reesh | June 8, 2010 4:00 PM | Report abuse\n\nSo much for "States Rights" and the will of the American people. The Clean Election law was a referendum passed by Arizona voters 10 years ago and now over-turned by activist non-elected judges in Washington. Why initiate a "stay" in the middle of an election? This is changing the rules in the middle of a game.\n\nI guess the Supreme Court is re-affirming the golden rule. They rule in favor of only those with the gold.\n\nPosted by: DesertLeap | June 8, 2010 3:24 PM _____________________________________ First off, the law was ruled unconstitutional by a Federal District Court judge in Arizona. The 9th District Court of Appeals, based in San Francisco, overruled that decision. The SCOTUS now is deciding whether to review that decision.\n\nSecondly, "States\' Rights" and the "will of the American (Arizonan?) people" don\'t allow a state to violate the 1st Amendment. Based on what I\'ve read, I don\'t think that this law does that, but it is within the jurisdiction of the SCOTUS to decide that question, if they decide to review the 9th Circuit decision. And if they don\'t, then the public financing money gets released.\n\nPosted by: luridone | June 8, 2010 3:50 PM | Report abuse\n\nSo much for "States Rights" and the will of the American people. The Clean Election law was a referendum passed by Arizona voters 10 years ago and now over-turned by activist non-elected judges in Washington. Why initiate a "stay" in the middle of an election? This is changing the rules in the middle of a game.\n\nI guess the Supreme Court is re-affirming the golden rule. They rule in favor of only those with the gold. _____________________ you have the facts backwards. An Arizona judge held the law unconstitutional under the U.S. Constitution (which is why states rights are irrelevant, by the way). The federal appellate court reversed, reinstating the law. The SCOTUS has stayed that decision, which signals that they may agree with the Arizona judge that the law is unconstitutional (which would be consistent with their recent decisions taking apart the McCain-Feingold campaign reforms on first amendment grounds as well.\n\nI\'m with George Will on this one. McCain-Feingold is garbage. This law may not be much better, but perhaps a closer call because the privately funded can still spend what they want. The answer to money in politics is transparency, better candidates, and brighter voters, not spending limits.\n\nPosted by: JoeT1 | June 8, 2010 3:47 PM | Report abuse\n\nSo much for "States Rights" and the will of the American people. The Clean Election law was a referendum passed by Arizona voters 10 years ago and now over-turned by activist non-elected judges in Washington. Why initiate a "stay" in the middle of an election? This is changing the rules in the middle of a game.\n\nI guess the Supreme Court is re-affirming the golden rule. They rule in favor of only those with the gold.\n\nPosted by: DesertLeap | June 8, 2010 3:24 PM _____________________________________ First off, the law was ruled unconstitutional by a Federal District Court judge in Arizona. The 9th District Court of Appeals, based in San Francisco, overruled that decision. The SCOTUS now is deciding whether to review that decision.\n\nSecondly, "States\' Rights" and the "will of the American (Arizonan?) people" don\'t allow a state to violate the 1st Amendment. Based on what I\'ve read, I don\'t think that this law does that, but it is within the jurisdiction of the SCOTUS to decide that question, if they decide to review the 9th Circuit decision. And if they don\'t, then the public financing money gets released.\n\nPosted by: luridone | June 8, 2010 3:46 PM | Report abuse\n\nElections are held to be bought.\n\nPosted by: Garak | June 8, 2010 3:41 PM | Report abuse\n\nSo much for "States Rights" and the will of the American people. The Clean Election law was a referendum passed by Arizona voters 10 years ago and now over-turned by activist non-elected judges in Washington. Why initiate a "stay" in the middle of an election? This is changing the rules in the middle of a game.\n\nI guess the Supreme Court is re-affirming the golden rule. They rule in favor of only those with the gold.\n\nPosted by: DesertLeap | June 8, 2010 3:24 PM | Report abuse\n\nis this the same suprem court that said lobbying is the same as petitioning? we should check their financials.stay out of the states business,do what you are suppose to do interpit the US CONSTITUTION.YOU DON`T MAKE LAWS.\n\nPosted by: SISSD1 | June 8, 2010 3:06 PM _______________________________ And the claim here is that the state law violates the plaintiff\'s 1st Amendment rights. Whether that argument is right or wrong (I happen to believe it\'s wrong), it\'s a question of Constitutional intepretation, and that puts it squarely within the jurisdiction of the SCOTUS.\n\nPosted by: luridone | June 8, 2010 3:24 PM | Report abuse\n\nHere we have the Soprano Court again meddling in State\'s rights. The State of Arizona decided to provided matching funds to candidates. SCOTUS issued an injunction against matching funds until they decide the issue. That, for this election cycle, means no matching funds in Arizona. How dare this Supreme Court manipulate elections? Of course, the privately financed fat cat wins.\n\nThose fascists who stole the 2000 presidential election should be impeached. And, I mean they should be impeached TODAY. We have enough problems without losing the right to vote. No American can possibly believe that corporations are people. Previous Supreme Court cases rejected the idea. How now did corporations come to life. Only God can do that! The idea is even more bizarre because it means ANY corporation stands on equal footing with American citizens. Iraq, Iran, Pakistan, Libya, Turkey, Syria, Saudi Arabia, Kuwait, Palestine, N Korea, Myanmar, any of them can spend as much money as they want secretly (through corporations) to influence our elections.\n\nElections are subject to state laws. That\'s what the constitution says. Get the federal government out of the states.\n\nPosted by: Reesh | June 8, 2010 3:23 PM | Report abuse\n\nis this the same suprem court that said lobbying is the same as petitioning? we should check their financials.stay out of the states business,do what you are suppose to do interpit the US CONSTITUTION.YOU DON`T MAKE LAWS.\n\nPosted by: SISSD1 | June 8, 2010 3:06 PM | Report abuse\n\nis this the same suprem court that said lobbying is the same as petitioning? we should check their financials.stay out of the states business,do what you are suppose to do interpit the US CONSTITUTION.YOU DON`T MAKE LAWS.\n\nPosted by: SISSD1 | June 8, 2010 3:06 PM | Report abuse\n\nliking dogs, for public relations is the choice for obamaa, rather than the traditional kissing of babies ,for obvious reasons ! little is known about why so many dogs are found in shelters from being abandaned by african owners with tags, some guesses arise from a corelation between conpetitive attributes in nature .\n\nPosted by: sideboom | June 8, 2010 2:58 PM | Report abuse\n\nafterall, companies, foreign or domestic, are U.S. citizens accordig to those old fools.\n\nI wonder, are unions people too?\n\nMaybe we should get both out of the equation?\n\nPosted by: VirginiaConservative | June 8, 2010 2:03 PM | Report abuse\n\nThis is a matter of a STATE\'S election funding rules!!!!! What the bleep does it have to do with Obama????? You Obama-haters need to take some Valium. Better yet, arsenic.\n\nPosted by: luridone | June 8, 2010 1:35 PM | Report abuse\n\nYou wrote, "If Obama is for it, I\'m against it." I hate to tell you, but this is a state of Arizona issue and Obama is not involved in it at all. So, basically, you people are idiots.\n\nBy the way, I hear Obama likes dogs. Even bought a puppy for his kids. So I suppose this means you\'re against puppies?\n\nPosted by: Len_RI1 | June 8, 2010 1:29 PM | Report abuse\n\n@farmsnorton: why do you just accept talking points, as opposed to just doing a few moments of research online, to actually find facts... and then making an informed opinion?? Reagan, both Bushes, and basically all Pols do as Obama did. It\'s called Politics. If you want to say Obama said he\'d be above normal Politics, and he hasn\'t been... that can be argued. But, that he\'s bribed, etc, that\'s just hypocrisy. As for the Gulf mess, the Feds don\'t have cameras monitoring every oil rig in the Gulf. The Feds HAD to rely on what BP was saying, until they ordered BP to open the camera feeds. That is when the Feds could start evaluating the extent of the mess. The GOP has blocked most appointments by Obama, thus, the Interior Dept & every other Agency is dangerously shorthanded. What you\'re seeing in the Gulf, is the result of 30 years of *small government, cut taxes, drill-baby-drill, Privatization is great, let Corporations monitor themselves, etc* ideology of the Right. The Feds just don\'t have the resources, to do much else than fight wars, any more. Educate yourself already, and stop letting others think for you.\n\nPosted by: burf | June 8, 2010 1:28 PM | Report abuse\n\nI agree with farmsnorton. If obama\'s for this, count me against it too. He\'s shifty, and things are orchestrated to serve his own agenda.\n\nPosted by: wmpowellfan | June 8, 2010 1:10 PM | Report abuse\n\nThe argument against the law is that allowing the opponent to get matching funds inhibits the free-speech rights of the big spender.\n\nIn reality, it only limits the ability of big-spenders to dominate the airwaves. But having a fair fight is not nearly as important to the big-spenders as winning at any cost. They know they will recoup their investment once they have political power.\n\nPosted by: ad9inaz | June 8, 2010 12:58 PM | Report abuse\n\nIf Obama is for it, I\'m against it. I have seen this man lie and bribe to get his way and this is not the American way. I have seen him turn away from our vets, travel, play golf and campaign for Boxer while he handed off the gulf crisis to BP for over 35 days. James Carville, a democratic adviser on CNN was screaming at Obama to get down there and lead. Nothing was really done for the first 35 days.\n\nPosted by: farmsnorton | June 8, 2010 12:52 PM | Report abuse\n\nWell we know who the Supreme Court is going to side with; afterall, companies, foreign or domestic, are U.S. citizens accordig to those old fools.\n\nPosted by: davidlhanegraaf | June 8, 2010 12:31 PM | Report abuse\n\nThe comments to this entry are closed.'
>>> tokenizer(s, max_length=1024, truncation=True)
thread '<unnamed>' panicked at 'index out of bounds: the len is 21830 but the index is 21830', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zchelllo/anaconda3/envs/ex2/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2371, in __call__
    **kwargs,
  File "/home/zchelllo/anaconda3/envs/ex2/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2441, in encode_plus
    **kwargs,
  File "/home/zchelllo/anaconda3/envs/ex2/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 467, in _encode_plus
    **kwargs,
  File "/home/zchelllo/anaconda3/envs/ex2/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 380, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
pyo3_runtime.PanicException: index out of bounds: the len is 21830 but the index is 21830
Peter-Devine commented 3 years ago

Can reproduce with

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")
s = '_ï¼\x8fâ\x94'
tokenizer(s, max_length=1024, truncation=True)

giving me:

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-17-672efbeb11a2> in <module>
      2 tokenizer = AutoTokenizer.from_pretrained("t5-small")
      3 s = '_ï¼\x8fâ\x94'
----> 4 tokenizer(s, max_length=1024, truncation=True)

c:\temp\notebook-temp\lib\site-packages\transformers\tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2366                 return_length=return_length,
   2367                 verbose=verbose,
-> 2368                 **kwargs,
   2369             )
   2370 

c:\temp\notebook-temp\lib\site-packages\transformers\tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2436             return_length=return_length,
   2437             verbose=verbose,
-> 2438             **kwargs,
   2439         )
   2440 

c:\temp\notebook-temp\lib\site-packages\transformers\tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    470             return_length=return_length,
    471             verbose=verbose,
--> 472             **kwargs,
    473         )
    474 

c:\temp\notebook-temp\lib\site-packages\transformers\tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    383             batch_text_or_text_pairs,
    384             add_special_tokens=add_special_tokens,
--> 385             is_pretokenized=is_split_into_words,
    386         )
    387 

PanicException: index out of bounds: the len is 16 but the index is 16
AlexDut commented 3 years ago

Hi,

I also encountered the same issue with XLMRobertaTokenizerFast:

from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
text = "™ngº›é~¦‡™\x1egzšéy¦—™^gz"
tokenizer(text)

throws the following:

thread '<unnamed>' panicked at 'index out of bounds: the len is 40 but the index is 40', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-1-cee12263f0ae> in <module>
      2 tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
      3 text = "™ngº›é~¦‡™\x1egzšéy¦—™^gz"
----> 4 tokenizer(text)

~/my-project/env/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2366                 return_length=return_length,
   2367                 verbose=verbose,
-> 2368                 **kwargs,
   2369             )
   2370 

~/my-project/env/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2436             return_length=return_length,
   2437             verbose=verbose,
-> 2438             **kwargs,
   2439         )
   2440 

~/my-project/env/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    470             return_length=return_length,
    471             verbose=verbose,
--> 472             **kwargs,
    473         )
    474 

~/my-project/env/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    383             batch_text_or_text_pairs,
    384             add_special_tokens=add_special_tokens,
--> 385             is_pretokenized=is_split_into_words,
    386         )
    387 

PanicException: index out of bounds: the len is 40 but the index is 40

I tested with tokenizers==0.10.0 and it works well, I suppose that a regression was introduced between the versions 0.10.0 and 0.10.1.

dorost1234 commented 3 years ago

Hi Sorry at first I did not realize this similar bug at first, I encounter similar issue and reported my case in https://github.com/huggingface/tokenizers/issues/654 with minimal example to reproduce the bug really appreciate your help