I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.
I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)
let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];
let config = TokenClassificationConfig::new(
ModelType::Bert,
ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),
LocalResource::from(PathBuf::from(model_config_path)),
LocalResource::from(PathBuf::from(vocab_path)),
Some(LocalResource::from(PathBuf::from(merge_path))), //merges resource only relevant with ModelType::Roberta
false, //lowercase
false,
None,
LabelAggregationOption::Mode,
);
Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.
let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path, false, false, special_tokens)?;
println!("{:?}", tokenizer.tokenize(input[0]));
let ner_model = NERModel::new_with_tokenizer(config, TokenizerOption::Bert(tokenizer))?;
let output = ner_model.predict_full_entities(&input);
for entity in output {
println!("{entity:?}");
}
Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.
let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();
println!("{:?}", tok_opt.tokenize(input[0]));
let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
But now I encountered a runtime panic during the prediction phase:
thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49
Environment:
Rust version: 1.77.2
PyTorch version: 2.2.0
tch version: v0.15.0
rust-bert copy of repository (current version from the main branch)
I would be grateful if you could help.
EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.
Description
I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.
I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)
Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.
as output I got:
Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.
But now I encountered a runtime panic during the prediction phase:
Environment:
I would be grateful if you could help.
EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.