[how-to-train] Link to a Google Colab version of the blogpost

julien-c commented 4 years ago

cc @srush!

aditya-malte commented 4 years ago

@OP I’m working on it, will share when done. Thanks

djstrong commented 4 years ago

There is missing config.json and tokenizer config.

Jazzpirate commented 4 years ago

+1, because I'm really confused by the blogpost... In particular, I have no idea how to "combine" the tokenizer and dataset implemented in python with the run_language_modeling.py script used for training, which seems to be intended to be run from a commandline rather than from code... I'm admittedly a noob, but seeing how that is done would be extremely helpful

aditya-malte commented 4 years ago

Check this, A small example I have created https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b#file-smallberta_pretraining-ipynb

aditya-malte commented 4 years ago

@julien-c , I have pruned the dataset to the first 200,000 samples so that the notebook may run quickly on Colab, as this is meant to be more like a quick tutorial to glue several things together than get SOTA performance. During actual training one could use the full data. Do share it with your network and STAR if found useful 🤓.

Jazzpirate commented 4 years ago

@aditya-malte Thanks a lot :) I'm still confused though. Both the original blog post and your notebook use ByteLevelBPETokenizer. If I save one of those (and rename the output files like your notebook does), I get two files "merges.txt" and "vocab.json" (which in my case live in the folder "./tokenizer". But if I point model_class.from_pretrained to the directory containing them (as your notebook does via the tokenizer_name-flag), I get: OSError: Model name './tokenizer' was not found in tokenizers model name list (<long list of names>). We assumed './tokenizer' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

I originally thought that meant that the PreTrainedTokenizer-class just isn't compatible with the way ByteLevelBPETokenizers are saved, but apparently it works in your notebook, so... what am I doing wrong? :(

aditya-malte commented 4 years ago

Hi, The easiest solution (and I have also used the same in my Colab notebook) is just to rename the files using !mv. I know this is a hack but it currently seems to work.

aditya-malte commented 4 years ago

@julien-c , this is another issue that I wanted to point out. While renaming does work, it is a bit confusing for the programmer and takes some time to figure out. Maybe the next release could also check for a tokenizer file format in -vocab.txt ,etc. Thanks

Jazzpirate commented 4 years ago

I did rename them, as you did in the notebook, but I still get the error... if I interpret the error message correctly, it expects a vocab.txt, but your notebook uses vocab.json and merges.txt - and I don't think either of the two files correponds to the vocab.txt it is looking for...?

aditya-malte commented 4 years ago

I’m not sure, I’ll have to see your code for that. Perhaps it could be possible that it is just an incorrect path.

Jazzpirate commented 4 years ago

It's the path to the folder containing the two files vocab.json and merges.txt - seemingly the same thing your notebook does, so I'm almost positive that's not it...

do different models use different tokenizers? It's currently set to "bert", not "roberta" as in your notebook, but I'd be very suprised if that would make a difference regarding tokenizer-file-structure? :D

aditya-malte commented 4 years ago

Did you call from_pretrained using a BertTokenizer object or a PretrainedTokenizer object?

Jazzpirate commented 4 years ago

@aditya-malte I'm doing it exactly like the script does... i.e. match on the model name and use

MODEL_CLASSES = {
    "gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),
    "openai-gpt": (OpenAIGPTConfig, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "bert": (BertConfig, BertForMaskedLM, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForMaskedLM, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForMaskedLM, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForMaskedLM, CamembertTokenizer),
}

...so in my case, that would be BertTokenizer. The relevant part of my code is this:


    def trainTokenizer(self, output_dir: str, file: str, tokenizer_class, vocab_size: int = 7000, min_frequency: int = 5):
        tokenizer = ByteLevelBPETokenizer()
        tokenizer.train(files=[file], vocab_size=vocab_size, min_frequency=min_frequency, special_tokens=[
            "<s>",
            "<pad>",
            "</s>",
            "<unk>",
            "<mask>"

        ])
        tokenizer._tokenizer.post_processor = BertProcessing(
            ("</s>", tokenizer.token_to_id("</s>")),
            ("<s>", tokenizer.token_to_id("<s>")),
        )
        tokenizer.enable_truncation(max_length=512)
        if not os.path.exists(output_dir + "/tokenizer"):
            os.makedirs(output_dir + "/tokenizer")
        tokenizer.save(output_dir + "/tokenizer", "")
        os.rename(output_dir + "/tokenizer/-merges.txt", output_dir + "/tokenizer/merges.txt")
        os.rename(output_dir + "/tokenizer/-vocab.json", output_dir + "/tokenizer/vocab.json")
        print("")
        return tokenizer_class.from_pretrained(output_dir + "/tokenizer", cache_dir=output_dir + "/cache")```

aditya-malte commented 4 years ago

Hmm, that’s strange. What is your version of Transformers and Tokenizers. Why use a cache_dir btw? If you’re not downloading from S3

Jazzpirate commented 4 years ago

freshly installed from a freshly upgraded version of pip thursday ;)

regarding cache_dir: no idea, just copied that from the script to see what ends up in there :D

aditya-malte commented 4 years ago

Wait, so if you’re not running it on Colab, with all other things remaining the same. I think it might be an issue with your environment. Also, just yesterday (or day before) I think there was a major change in the Transformer library

Jazzpirate commented 4 years ago

~Heh, so uninstalling with pip and pulling the git repo directly seems to have solved it. Thanks :)~ Huh, no, it actually didn't, I just overlooked that I commented out the offending line earlier :( Problem still stands...

Jazzpirate commented 4 years ago

okay, update: the problem really was the model.

tokenization_roberta.py:

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

tokenization_bert.py:

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

...I have no idea why they use different file conventions, and in particular why bert doesn't allow one to use the ones from ByteLevelBPETokenizer... :/

aditya-malte commented 4 years ago

Great!

Jazzpirate commented 4 years ago

I'm currently writing class abstractions to assemble a model in-code that uses the right tokenizers-class depending on the model_type. My goal is to have something like (yes, I'm heavily biased towards object-orientation):

# for a new model
tokenizer = trainTokenizer(data_file, model_type="<somemodel>")
dataset = <something>
model = ModularModel(out_file="some/path", tokenizer, model_type = "<somemodel>", non_default_parameters={})
model.train(train_dataset=dataset, eval_dataset=None)

...
# for a pre-saved model
loaded_model = ModularModel(out_file="some/path")
...

Would that be useful to anyone other than me?

aditya-malte commented 4 years ago

I so strongly agree with you, and I too feel that the community should go in an OOP direction.(Rather than the CLI way, that we’re all using abstractions now) Do share your code

Jazzpirate commented 4 years ago

@aditya-malte Here it is: Classes: https://pastebin.com/71N3gp7C - mostly copy-pasted from the run-language-modeling-script, but all shell parameters replaced by a dict with all entries optional. example usage: https://pastebin.com/SQUf61aD

The default parameters used are questionable for sure. For camembert, I couldn't find out what kind of tokenizer can generate the files the PretrainedTokenizer-subclass expects, so that one won't work, but afaik all other ones work out of the box.

Jazzpirate commented 4 years ago

@aditya-malte sorry, found a slight error: Line 686 needs to be sorted_checkpoints = self._sorted_checkpoints(output_dir + "/out") in order for continuing from a checkpoint to work

jbmaxwell commented 4 years ago

@Jazzpirate This is awesome! The only thing that's a bit unfortunate (though probably operator error) is that it only seems to run on cpu for me. Is there a way to specify gpu?

Gulp... spoke too soon: no_cuda = False... duh

Jazzpirate commented 4 years ago

@jbmaxwell you can pass on a dictionary to the ModularModel constructor with the CLI-parameters you wish to be used. By default, it set no_cuda to True, because I couldn't get Cuda to run on my machine without compiling "old" linux kernels myself :/

One more "bug" I found: if you want to use evaluation during training, make sure to replace the calls to evaluate in the train and _train methods with _evaluate instead.

I honestly only did this to the point where I could use it for my own purposes. If I can get something out of this, I might be inclined to make a fork and pull-request with a more stable script at some point, but I honestly have no real idea what I'm doing (yet?) :D

ddofer commented 4 years ago

The huggingface library in general would massively benefit from keeping things in the code and not an unholy, messy blend of CLI. (A bit like how fast-BERT does it. https://github.com/kaushaltrivedi/fast-bert)

On Tue, Feb 25, 2020 at 11:31 AM Dennis Müller notifications@github.com wrote:

@jbmaxwell https://github.com/jbmaxwell you can pass on a dictionary to the ModularModel constructor with the CLI-parameters you wish to be used. By default, it set no_cuda to True, because I couldn't get Cuda to run on my machine without compiling "old" linux kernels myself :/

One more "bug" I found: if you want to use evaluation during training, make sure to replace the calls to evaluate in the train and _train methods with _evaluate instead.

I honestly only did this to the point where I could use it for my own purposes. If I can get something out of this, I might be inclined to make a fork and pull-request with a more stable script at some point, but I honestly have no real idea what I'm doing (yet?) :D

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/3?email_source=notifications&email_token=ABHCDA5EFY4ZHTHVDG52BY3RETQPNA5CNFSM4KVYMW7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM3HDEY#issuecomment-590770579, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHCDA2OIDZQXAEKT7MEN63RETQPNANCNFSM4KVYMW7A .

-- Dan Ofer

Data Czar Data Scientist

Cell: +972-524-600688 <%2B972-52-5799899> [image: Image result for sparkbeyond]

chaitjo commented 4 years ago

As a new user to the Transformers/Tokenizers library, I had trouble following the blogpost, too. Following this thread for a clean notebook which I can follow.

What I want to do is train from scratch a language model with a custom architecture, e.g., I want to play around with the BERT layers.

Suggestion: Using the wikitext-2 dataset?

aditya-malte commented 4 years ago

Hi, Just change the config variable in this colab notebook to adjust number of layers. https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b#file-smallberta_pretraining-ipynb Thanks

julien-c commented 4 years ago

@Jazzpirate RoBERTa uses a Byte-level BPE tokenizer (similar to what GPT-2 uses) whereas BERT uses a Wordpiece tokenizer.

A Wordpiece tokenizer is only based on a set of tokens, starting from whole words and decomposing its way into single tokens. Whereas the BPE algorithm merges tokens together according to merge pairs which are stored in a separate file – hence the serialization formats are different.

julien-c commented 4 years ago

@aditya-malte Thanks for sharing, this looks good and contains great ideas.

A few comments:

with transformers 2.5.1, there should not be a need to mv the tokenizer files anymore (as they are created with the "standard" file names by default)
You shouldn't need to store each sample in a different text file.
I was thinking of adding a (slightly closer to the blog post) notebook in /notebooks in this repo, and linking to it with an "Open in colab" badge on the blog post, maybe we can collaborate on it

chaitjo commented 4 years ago

@julien-c Would be awesome to have the blogpost as a notebook, esp. for someone new to the HuggingFace ecosystem.

julien-c commented 4 years ago

First version is up here, would love to get your feedback: https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

julien-c commented 4 years ago

Also linked from the blog post with an Open in Colab badge

Oxi84 commented 4 years ago

How do I get hidden states of the last layer from this custom model? Input would be just the words without any mask?

Sharvil97 commented 3 years ago

First version is up here, would love to get your feedback: https://github.com/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb

Thanks for the tutorial. Although I am unable to load the dataset in the RAM. Is there any other way to load the data?

Adportas commented 3 years ago

Hello,

thanks for your tutorial, could you please advice in the minimum VRAM required for running this fine tuning(I will like to do it on my local machine which has 6GB of VRAM, I plan to buy a bigger card but will like to know minimum requirements for the task).

huggingface / blog

[how-to-train] Link to a Google Colab version of the blogpost #3