Closed nimanthadilz closed 7 months ago
Sorry for coming back late on this! The Bert
has a similar process and uses WordPiece
with continuing_subword_prefix="##"
which is probably what you are looking for no?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi, I am using the tokenizers library to build a tokenizer that can be used to tokenize Java code into valid Java tokens. This tokenizer will be used in a transformer model which can fix bugs in Java code.
So far, what I've done is, I have used the javalang library to identify valid Java tokens. I've created a custom pre-tokenizer which uses javalang to split the input into valid Java code. As the
model
of the tokenizer, I've usedWordLevel
since I don't need subword tokenization.This tokenizer can now tokenize a text of Java code into valid Java tokens. For example:
I need to split the identifiers like
getAge
(method names, variable names) which are camelCase into separate tokens. When I do that, I have to add some symbol (like"#"
) to represent that splitted tokens are originally one token. So that I can later concatenate them.But I can't find a way to do this in my custom pre-tokenizer. There we are getting a
NormalizedString
as the input. I tried to add a symbol when splitting camelCase tokens but didn't work. Is there a way to achieve that or is there a better way to do this than what I've done?My custom pre-tokenizer is below: