guillaume-be / rust-bert

Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...)
https://docs.rs/crate/rust-bert
Apache License 2.0
2.6k stars 215 forks source link

NER with BERT-based Model: Unexpected Panic During Prediction #455

Open mmich-pl opened 5 months ago

mmich-pl commented 5 months ago

Description

I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.

I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)

  let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];

  let config = TokenClassificationConfig::new(
          ModelType::Bert,
          ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),
          LocalResource::from(PathBuf::from(model_config_path)),
          LocalResource::from(PathBuf::from(vocab_path)),
          Some(LocalResource::from(PathBuf::from(merge_path))),  //merges resource only relevant with ModelType::Roberta
          false, //lowercase
          false,
          None,
          LabelAggregationOption::Mode,
      );

Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.

    let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path, false, false, special_tokens)?;
    println!("{:?}", tokenizer.tokenize(input[0]));

    let ner_model = NERModel::new_with_tokenizer(config, TokenizerOption::Bert(tokenizer))?;
    let output = ner_model.predict_full_entities(&input);
    for entity in output {
        println!("{entity:?}");
    }

as output I got:

["<unk>", "się", "Jan", "<unk>", "i", "<unk>", "we", "<unk>", "."]
[]

Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.

    let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();
    println!("{:?}", tok_opt.tokenize(input[0]));
    let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
["Nazy", "wam</w>", "się</w>", "Jan</w>", "Kowalski</w>", "i</w>", "mieszkam</w>", "we</w>", "Wrocławiu</w>", ".</w>"]

But now I encountered a runtime panic during the prediction phase:

thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49

Environment:

I would be grateful if you could help.

EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.