Open julien-c opened 4 years ago
@OP I’m working on it, will share when done. Thanks
There is missing config.json
and tokenizer config.
+1, because I'm really confused by the blogpost... In particular, I have no idea how to "combine" the tokenizer and dataset implemented in python with the run_language_modeling.py
script used for training, which seems to be intended to be run from a commandline rather than from code... I'm admittedly a noob, but seeing how that is done would be extremely helpful
Check this, A small example I have created https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b#file-smallberta_pretraining-ipynb
@julien-c , I have pruned the dataset to the first 200,000 samples so that the notebook may run quickly on Colab, as this is meant to be more like a quick tutorial to glue several things together than get SOTA performance. During actual training one could use the full data. Do share it with your network and STAR if found useful 🤓.
@aditya-malte Thanks a lot :) I'm still confused though.
Both the original blog post and your notebook use ByteLevelBPETokenizer
. If I save one of those (and rename the output files like your notebook does), I get two files "merges.txt" and "vocab.json" (which in my case live in the folder "./tokenizer". But if I point model_class.from_pretrained
to the directory containing them (as your notebook does via the tokenizer_name
-flag), I get:
OSError: Model name './tokenizer' was not found in tokenizers model name list (<long list of names>). We assumed './tokenizer' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.
I originally thought that meant that the PreTrainedTokenizer
-class just isn't compatible with the way ByteLevelBPETokenizer
s are saved, but apparently it works in your notebook, so... what am I doing wrong? :(
Hi, The easiest solution (and I have also used the same in my Colab notebook) is just to rename the files using !mv. I know this is a hack but it currently seems to work.
@julien-c , this is another issue that I wanted to point out. While renaming does work, it is a bit confusing for the programmer and takes some time to figure out.
Maybe the next release could also check for a tokenizer file format in
I did rename them, as you did in the notebook, but I still get the error... if I interpret the error message correctly, it expects a vocab.txt
, but your notebook uses vocab.json
and merges.txt
- and I don't think either of the two files correponds to the vocab.txt
it is looking for...?
I’m not sure, I’ll have to see your code for that. Perhaps it could be possible that it is just an incorrect path.
It's the path to the folder containing the two files vocab.json
and merges.txt
- seemingly the same thing your notebook does, so I'm almost positive that's not it...
do different models use different tokenizers? It's currently set to "bert", not "roberta" as in your notebook, but I'd be very suprised if that would make a difference regarding tokenizer-file-structure? :D
Did you call from_pretrained using a BertTokenizer object or a PretrainedTokenizer object?
@aditya-malte I'm doing it exactly like the script does... i.e. match on the model name and use
MODEL_CLASSES = {
"gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
"openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
"bert": (BertConfig, BertForMaskedLM, BertTokenizer),
"roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
"distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
"camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}
...so in my case, that would be BertTokenizer
. The relevant part of my code is this:
def trainTokenizer(self, output_dir: str, file: str, tokenizer_class, vocab_size: int = 7000, min_frequency: int = 5):
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=[file], vocab_size=vocab_size, min_frequency=min_frequency, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>"
])
tokenizer._tokenizer.post_processor = BertProcessing(
("</s>", tokenizer.token_to_id("</s>")),
("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)
if not os.path.exists(output_dir + "/tokenizer"):
os.makedirs(output_dir + "/tokenizer")
tokenizer.save(output_dir + "/tokenizer", "")
os.rename(output_dir + "/tokenizer/-merges.txt", output_dir + "/tokenizer/merges.txt")
os.rename(output_dir + "/tokenizer/-vocab.json", output_dir + "/tokenizer/vocab.json")
print("")
return tokenizer_class.from_pretrained(output_dir + "/tokenizer", cache_dir=output_dir + "/cache")```
Hmm, that’s strange. What is your version of Transformers and Tokenizers. Why use a cache_dir btw? If you’re not downloading from S3
freshly installed from a freshly upgraded version of pip thursday ;)
regarding cache_dir: no idea, just copied that from the script to see what ends up in there :D
Wait, so if you’re not running it on Colab, with all other things remaining the same. I think it might be an issue with your environment. Also, just yesterday (or day before) I think there was a major change in the Transformer library
~Heh, so uninstalling with pip and pulling the git repo directly seems to have solved it. Thanks :)~ Huh, no, it actually didn't, I just overlooked that I commented out the offending line earlier :( Problem still stands...
okay, update: the problem really was the model.
tokenization_roberta.py:
VOCAB_FILES_NAMES = {
"vocab_file": "vocab.json",
"merges_file": "merges.txt",
}
tokenization_bert.py:
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
...I have no idea why they use different file conventions, and in particular why bert doesn't allow one to use the ones from ByteLevelBPETokenizer
... :/
Great!
I'm currently writing class abstractions to assemble a model in-code that uses the right tokenizers
-class depending on the model_type. My goal is to have something like (yes, I'm heavily biased towards object-orientation):
# for a new model
tokenizer = trainTokenizer(data_file, model_type="<somemodel>")
dataset = <something>
model = ModularModel(out_file="some/path", tokenizer, model_type = "<somemodel>", non_default_parameters={})
model.train(train_dataset=dataset, eval_dataset=None)
...
# for a pre-saved model
loaded_model = ModularModel(out_file="some/path")
...
Would that be useful to anyone other than me?
I so strongly agree with you, and I too feel that the community should go in an OOP direction.(Rather than the CLI way, that we’re all using abstractions now) Do share your code
@aditya-malte Here it is:
Classes: https://pastebin.com/71N3gp7C - mostly copy-pasted from the run-language-modeling
-script, but all shell parameters replaced by a dict with all entries optional.
example usage: https://pastebin.com/SQUf61aD
The default parameters used are questionable for sure. For camembert, I couldn't find out what kind of tokenizer can generate the files the PretrainedTokenizer
-subclass expects, so that one won't work, but afaik all other ones work out of the box.
@aditya-malte sorry, found a slight error: Line 686 needs to be
sorted_checkpoints = self._sorted_checkpoints(output_dir + "/out")
in order for continuing from a checkpoint to work
@Jazzpirate This is awesome! The only thing that's a bit unfortunate (though probably operator error) is that it only seems to run on cpu for me. Is there a way to specify gpu?
Gulp... spoke too soon: no_cuda = False
... duh
@jbmaxwell you can pass on a dictionary to the ModularModel
constructor with the CLI-parameters you wish to be used. By default, it set no_cuda
to True
, because I couldn't get Cuda to run on my machine without compiling "old" linux kernels myself :/
One more "bug" I found: if you want to use evaluation during training, make sure to replace the calls to evaluate
in the train
and _train
methods with _evaluate
instead.
I honestly only did this to the point where I could use it for my own purposes. If I can get something out of this, I might be inclined to make a fork and pull-request with a more stable script at some point, but I honestly have no real idea what I'm doing (yet?) :D
The huggingface library in general would massively benefit from keeping things in the code and not an unholy, messy blend of CLI. (A bit like how fast-BERT does it. https://github.com/kaushaltrivedi/fast-bert)
On Tue, Feb 25, 2020 at 11:31 AM Dennis Müller notifications@github.com wrote:
@jbmaxwell https://github.com/jbmaxwell you can pass on a dictionary to the ModularModel constructor with the CLI-parameters you wish to be used. By default, it set no_cuda to True, because I couldn't get Cuda to run on my machine without compiling "old" linux kernels myself :/
One more "bug" I found: if you want to use evaluation during training, make sure to replace the calls to evaluate in the train and _train methods with _evaluate instead.
I honestly only did this to the point where I could use it for my own purposes. If I can get something out of this, I might be inclined to make a fork and pull-request with a more stable script at some point, but I honestly have no real idea what I'm doing (yet?) :D
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/3?email_source=notifications&email_token=ABHCDA5EFY4ZHTHVDG52BY3RETQPNA5CNFSM4KVYMW7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM3HDEY#issuecomment-590770579, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHCDA2OIDZQXAEKT7MEN63RETQPNANCNFSM4KVYMW7A .
-- Dan Ofer
Data Czar Data Scientist
Cell: +972-524-600688 <%2B972-52-5799899> [image: Image result for sparkbeyond]
As a new user to the Transformers/Tokenizers library, I had trouble following the blogpost, too. Following this thread for a clean notebook which I can follow.
What I want to do is train from scratch a language model with a custom architecture, e.g., I want to play around with the BERT layers.
Suggestion: Using the wikitext-2 dataset?
Hi, Just change the config variable in this colab notebook to adjust number of layers. https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b#file-smallberta_pretraining-ipynb Thanks
@Jazzpirate RoBERTa uses a Byte-level BPE tokenizer (similar to what GPT-2 uses) whereas BERT uses a Wordpiece tokenizer.
A Wordpiece tokenizer is only based on a set of tokens, starting from whole words and decomposing its way into single tokens. Whereas the BPE algorithm merges tokens together according to merge pairs which are stored in a separate file – hence the serialization formats are different.
@aditya-malte Thanks for sharing, this looks good and contains great ideas.
A few comments:
@julien-c Would be awesome to have the blogpost as a notebook, esp. for someone new to the HuggingFace ecosystem.
First version is up here, would love to get your feedback: https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
How do I get hidden states of the last layer from this custom model? Input would be just the words without any mask?
First version is up here, would love to get your feedback: https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
Thanks for the tutorial. Although I am unable to load the dataset in the RAM. Is there any other way to load the data?
Hello,
thanks for your tutorial, could you please advice in the minimum VRAM required for running this fine tuning(I will like to do it on my local machine which has 6GB of VRAM, I plan to buy a bigger card but will like to know minimum requirements for the task).
cc @srush!