SentencePiece Models for Chinese Models Missing?

google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Apache License 2.0

3.25k stars 569 forks source link

SentencePiece Models for Chinese Models Missing? #58

Open simonc256 opened 4 years ago

simonc256 commented 4 years ago

It seems that the SPM model files are missing from the tar files for Chinese models.

Danny-Google commented 4 years ago

For Chinese models, we use word piece model provided by Jacob as sentence piece get worse performance on reading comprehension tasks for Chinese.

beamind commented 4 years ago

For Chinese models, we use word piece model provided by Jacob as sentence piece get worse performance on reading comprehension tasks for Chinese.

hi, may you share the word piece model used in albert Chinese model? Thanks!

Danny-Google commented 4 years ago

The vocab file is in the same folder with the model. For word piece, you only need the vocab file, not the model. You can skip the model part for the input.

penut85420 commented 4 years ago

@beamind I think it's mean that you have to use --vocab_file instead of --spm_model_file.

But @Danny-Google I encouter a problem when I do this change in run_squad_v1.py, when it goes to squad_utils.convert_examples_to_features, spm model must be used in tokenization.encode_pieces, but a tokenizer build with a vocab file doesn't have a spm model.

I'm not familiar with spm so i have no idea how to modify it currently.

008karan commented 4 years ago

@Danny-Google @beamind @penut85420 @0x0539 Are you able to solve it. I want to use albert chinese and I am using huggingface pipeline for sequence classification which gives error as spiece.model is missing?

We assumed '/home/transformers/albert_base_zh/' was a path or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

Danny-Google commented 4 years ago

@beamind Currently, Squad_utils is meant to be used only for squad dataset. If you use Chinese models, you may want to take a look at the clue code (https://github.com/CLUEbenchmark/CLUE/tree/master/baselines/models/albert).

@008karan The Chinese model use wordpiece and you want to disable the sentence piece part of the code.

penut85420 commented 4 years ago

@Danny-Google thanks! i will keep on researching.