Closed hazalturkmen closed 2 years ago
If you first serialize your dataset to local files, and then use that path as a dataset name; does it work then?
Thank you for answer. I am new for using Huggingface dataset libary. My corpus only contains sentences with each line being a sentence. How can I serialize my dataset?
Let me cc @albertvillanova, who might have seen this error before :)
Hi @hazalturkmen, thanks for reporting.
When looking at your script call, I see you pass "default-b06526c46e9384b1" as dataset_config_name
, then pointing to the cache (containing the Arrow file instead of the text file). I guess this is the cause of the issue.
When running the script run_mlm_flax.py
with local data files, you should pass instead the parameter train_file
(and validation_file
if applicable).
In summary, when calling run_mlm_flax.py
:
dataset_name
nor dataset_config_name
train_file
!python run_mlm_flax.py \
--output_dir="/content/bert" \
--model_type="bert" \
--config_name="/content/bert" \
--tokenizer_name="/content/bert" \
--line_by_line=True \
--train_file="/content/drive/MyDrive/Scorpus.txt" \
--max_seq_length="512" \
--weight_decay="0.01" \
--per_device_train_batch_size="128" \
--learning_rate="3e-4" \
--overwrite_output_dir \
--num_train_epochs="16" \
--adam_beta1="0.9"
Please, let me know if this fixes your issue.
Thank you @albertvillanova ! It fixed the issue :+1:
Hi, I want to develop BERT model from scratch using Turkish text corpus. First I created tokenizer and I load text data from my local as seen below
Then, I run
run_mlm_flax.py
And I get an error
Note: I use google colab