google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.24k stars 569 forks source link

ensure str for the case of bytes type. #223

Open jnory opened 4 years ago

jnory commented 4 years ago

Hi,

I noticed that create_pretraining_data.py aborts by the error:

  File "albert/create_pretraining_data.py", line 405, in <listcomp>
    for i in piece])):
AttributeError: 'int' object has no attribute 'lower'

The reason why the error occurs is that the variable piece may be a bytes type in Python 3.

I'm using sentencepiece tokenizer, and, the minimal case of the input text is following (the text comes from wikipedia):

カムデンは(39.937195, -75.106186)に位置する。

This small PR fixes the problem by ensuring str type for the piece. Please let me know if you notice anything.

Sincerely,

googlebot commented 4 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

jnory commented 4 years ago

@googlebot I signed it!

googlebot commented 4 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.