Open simonc256 opened 4 years ago
For Chinese models, we use word piece model provided by Jacob as sentence piece get worse performance on reading comprehension tasks for Chinese.
For Chinese models, we use word piece model provided by Jacob as sentence piece get worse performance on reading comprehension tasks for Chinese.
hi, may you share the word piece model used in albert Chinese model? Thanks!
The vocab file is in the same folder with the model. For word piece, you only need the vocab file, not the model. You can skip the model part for the input.
@beamind I think it's mean that you have to use --vocab_file
instead of --spm_model_file
.
But @Danny-Google I encouter a problem when I do this change in run_squad_v1.py
, when it goes to squad_utils.convert_examples_to_features
, spm model must be used in tokenization.encode_pieces
, but a tokenizer build with a vocab file doesn't have a spm model.
I'm not familiar with spm so i have no idea how to modify it currently.
@Danny-Google @beamind @penut85420 @0x0539 Are you able to solve it. I want to use albert chinese and I am using huggingface pipeline for sequence classification which gives error as spiece.model is missing?
We assumed '/home/transformers/albert_base_zh/' was a path or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
@beamind Currently, Squad_utils is meant to be used only for squad dataset. If you use Chinese models, you may want to take a look at the clue code (https://github.com/CLUEbenchmark/CLUE/tree/master/baselines/models/albert).
@008karan The Chinese model use wordpiece and you want to disable the sentence piece part of the code.
@Danny-Google thanks! i will keep on researching.
It seems that the SPM model files are missing from the tar files for Chinese models.