tensorflow / text

Making text a first-class citizen in TensorFlow.
https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro
Apache License 2.0
1.23k stars 344 forks source link

Running SentencePieceModel.tokenize in a map with num_parallel_calls=tf.data.experimental.AUTOTUNE freezes #374

Open craffel opened 4 years ago

craffel commented 4 years ago

On my local machine, the following code snippet hangs after printing out a few examples (and cannot be killed via a keyboard interrupt; it must be sigkilled):

import tensorflow as tf
import tensorflow_text
with tf.io.gfile.GFile("gs://t5-data/vocabs/cc_all.32000/sentencepiece.model", "rb") as f:
      tokenizer = tensorflow_text.SentencepieceTokenizer(model=f.read())
ds = tf.data.Dataset.from_tensor_slices({"a": ["b"]*10})
ds = ds.map(lambda ex: tokenizer.tokenize(ex["a"]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
for ex in ds.as_numpy_iterator():
  print(ex)

This does not freeze:

import tensorflow as tf
import tensorflow_text
with tf.io.gfile.GFile("gs://t5-data/vocabs/cc_all.32000/sentencepiece.model", "rb") as f:
      tokenizer = tensorflow_text.SentencepieceTokenizer(model=f.read())
ds = tf.data.Dataset.from_tensor_slices({"a": ["b"]*10})
ds = ds.map(lambda ex: tokenizer.tokenize(ex["a"]))
for ex in ds.as_numpy_iterator():
  print(ex)

nor does this:

import tensorflow as tf
ds = tf.data.Dataset.from_tensor_slices(tf.range(10))
ds = ds.map(lambda a: a + 2, num_parallel_calls=tf.data.experimental.AUTOTUNE)
for ex in ds.as_numpy_iterator():
    print(ex)

It appears that there is some problematic interaction between setting num_parallel_calls=tf.data.experimental.AUTOTUNE in a map and tensorflow_text.SentencepieceTokenizer.tokenize. Note that this is only on my local machine; it does not appear to be true e.g. in a public colab kernel. The Python environment I am using was created via pyenv; using Python 3.8.5 and tensorflow/tensorflow-text==2.3.0. Here is the output of pip freeze.

absl-py==0.10.0
appnope==0.1.0
argon2-cffi==20.1.0
astunparse==1.6.3
attrs==20.1.0
Babel==2.8.0
backcall==0.2.0
bleach==3.1.5
boto==2.49.0
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
click==7.1.2
decorator==4.4.2
defusedxml==0.6.0
dill==0.3.2
distro==1.5.0
dm-tree==0.1.5
entrypoints==0.3
filelock==3.0.12
flake8==3.8.3
future==0.18.2
gast==0.3.3
gevent==20.6.2
gin-config==0.3.0
google-api-core==1.22.1
google-api-python-client==1.10.0
google-auth==1.20.1
google-auth-httplib2==0.0.4
google-auth-oauthlib==0.4.1
google-cloud-core==1.4.1
google-cloud-storage==1.30.0
google-compute-engine==2.8.13
google-crc32c==0.1.0
google-pasta==0.2.0
google-resumable-media==0.7.1
googleapis-common-protos==1.52.0
greenlet==0.4.16
grpcio==1.31.0
h5py==2.10.0
httplib2==0.18.1
idna==2.10
importlib-resources==3.0.0
ipykernel==5.3.4
ipython==7.17.0
ipython-genutils==0.2.0
jedi==0.17.2
Jinja2==2.11.2
joblib==0.16.0
json5==0.9.5
jsonschema==3.2.0
jupyter-client==6.1.6
jupyter-core==4.6.3
jupyterlab==2.2.5
jupyterlab-server==1.2.0
Keras-Preprocessing==1.1.2
Markdown==3.2.2
MarkupSafe==1.1.1
mccabe==0.6.1
mesh-tensorflow==0.1.16
mistune==0.8.4
nbconvert==5.6.1
nbformat==5.0.7
nltk==3.5
notebook==6.1.3
numpy==1.19.1
oauth2client==4.1.3
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.4
pandas==1.1.0
pandocfilters==1.4.2
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
portalocker==2.0.0
prometheus-client==0.8.0
promise==2.3
prompt-toolkit==3.0.6
protobuf==3.13.0
ptyprocess==0.6.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.6.0
pycparser==2.20
pyflakes==2.2.0
Pygments==2.6.1
pyparsing==2.4.7
pyrsistent==0.16.0
python-dateutil==2.8.1
pytz==2020.1
pyzmq==19.0.2
regex==2020.7.14
requests==2.24.0
requests-oauthlib==1.3.0
rouge-score==0.0.4
rsa==4.6
sacrebleu==1.4.13
sacremoses==0.0.43
scikit-learn==0.23.2
scipy==1.5.2
Send2Trash==1.5.0
sentencepiece==0.1.91
six==1.15.0
t5==0.6.4
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.3.0
tensorflow-metadata==0.23.0
tensorflow-text==2.3.0
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
tfds-nightly==3.2.1.dev202008200105
threadpoolctl==2.1.0
tokenizers==0.8.1rc1
torch==1.6.0
tornado==6.0.4
tqdm==4.48.2
traitlets==4.3.3
transformers==3.0.2
uritemplate==3.0.1
urllib3==1.25.10
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
wrapt==1.12.1
zope.event==4.4
zope.interface==5.1.0
broken commented 4 years ago

Are you on MacOS or Linux?

craffel commented 4 years ago

This laptop is MacOS.

craffel commented 4 years ago

To follow-up, setting num_parallel_calls to anything 2 or above also causes the same hang, though it is intermittent (i.e. sometimes it iterates over the dataset without hanging, sometimes it doesn't):

import tensorflow as tf
import tensorflow_text
with tf.io.gfile.GFile("gs://t5-data/vocabs/cc_all.32000/sentencepiece.model", "rb") as f:
      tokenizer = tensorflow_text.SentencepieceTokenizer(model=f.read())
ds = tf.data.Dataset.from_tensor_slices({"a": ["b"]*10})
ds = ds.map(lambda ex: tokenizer.tokenize(ex["a"]), num_parallel_calls=2)
for ex in ds.as_numpy_iterator():
  print(ex)

So, it does not have anything to do with tf.data.experimental.AUTOTUNE, it seems to be an interaction between a parallel map and tokenize.

s4sarath commented 3 years ago

I am having the same issue. Any update .

broken commented 3 years ago

I believe this is fixed with https://github.com/tensorflow/text/commit/52f9004b4c34d0aea3c55dc3eb41a199946b8550

However, we haven't been able to get TF Text to build with this on Windows yet. (The virtual function table is missing for AttrValue in the TF lib). I've been actively working with the TF Infra team to get TF exporting this correctly, but it's been difficult.

s4sarath commented 3 years ago

Hi, I will explain what I am facing now.

In cloud TPU VM , the provides 👍 version of TF is 2.6.0 . If I do pip install tf-text it tries to overwrite that.

So I build it from scratch. But so many errors especially "local_config_tf" not found. The fix is I replaced @._config_tf//:libtensorflow_framework" -> @._tensorflow//tensorflow/core:framework" @._config_tf//:tf_header_lib" -> @._tensorflow//tensorflow/core:lib"

in 2 files

a. tftext.bzl b.third_party/sentencepiece/processor.patch

Build is successful all good. But when I am using Sentencepiece tokenizer and BertTokenizer inside tf.data.Dataset.map function , "Segmentation fault (core dumped)".

I checked "oss_configure/.run_tests.sh". 6 tests related to tokenizer failed. Remaining 274 tests were successful.

Any suggestion or help would be appreciated.

Thanks

On Mon, 12 Jul, 2021, 11:59 pm Robert Neale, @.***> wrote:

I believe this is fixed with 52f9004 https://github.com/tensorflow/text/commit/52f9004b4c34d0aea3c55dc3eb41a199946b8550

However, we haven't been able to get TF Text to build with this on Windows yet. (The virtual function table is missing for AttrValue in the TF lib). I've been actively working with the TF Infra team to get TF exporting this correctly, but it's been difficult.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/text/issues/374#issuecomment-878499438, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KFU7VEOV5XJFNHQBJTTXMYCPANCNFSM4QHPKXQQ .

s4sarath commented 3 years ago

Having said that, outside map, it's working fine. But, I needed it inside map to do preprocessing on the fly.

On Tue, 13 Jul, 2021, 6:53 am sarath r nair, @.***> wrote:

Hi, I will explain what I am facing now.

In cloud TPU VM , the provides 👍 version of TF is 2.6.0 . If I do pip install tf-text it tries to overwrite that.

So I build it from scratch. But so many errors especially "local_config_tf" not found. The fix is I replaced @._config_tf//:libtensorflow_framework" -> @._tensorflow//tensorflow/core:framework" @._config_tf//:tf_header_lib" -> @._tensorflow//tensorflow/core:lib"

in 2 files

a. tftext.bzl b.third_party/sentencepiece/processor.patch

Build is successful all good. But when I am using Sentencepiece tokenizer and BertTokenizer inside tf.data.Dataset.map function , "Segmentation fault (core dumped)".

I checked "oss_configure/.run_tests.sh". 6 tests related to tokenizer failed. Remaining 274 tests were successful.

Any suggestion or help would be appreciated.

Thanks

On Mon, 12 Jul, 2021, 11:59 pm Robert Neale, @.***> wrote:

I believe this is fixed with 52f9004 https://github.com/tensorflow/text/commit/52f9004b4c34d0aea3c55dc3eb41a199946b8550

However, we haven't been able to get TF Text to build with this on Windows yet. (The virtual function table is missing for AttrValue in the TF lib). I've been actively working with the TF Infra team to get TF exporting this correctly, but it's been difficult.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/text/issues/374#issuecomment-878499438, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KFU7VEOV5XJFNHQBJTTXMYCPANCNFSM4QHPKXQQ .

broken commented 3 years ago

tf 2.6 isn't released yet; is this tf-nightly? Try installing tensorflow-text-nightly instead.

s4sarath commented 3 years ago

Yes I am having the same assumption 2.6.0 isn't released. :-) But, the version of tensorflow in TPU VM is 2.6.0 .

If I do install "tensorflow-text-nightly instead", it will replace the existing TF version (2.6.0) with 2.6.0rc0 . Then, TPU devices won't be recognized. So, I was restricted to use the provided TF version. And I build all .so for tf-text locally. Only SentencepieceTokenizer and BertTokenizer is not working inside tf.data.Dataset.map .

On Tue, Jul 13, 2021 at 9:16 AM Robert Neale @.***> wrote:

tf 2.6 isn't released yet; is this tf-nightly? Try installing tensorflow-text-nightly instead.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/text/issues/374#issuecomment-878754590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KADULF2FZMSFUH3PDDTXOZIPANCNFSM4QHPKXQQ .

broken commented 3 years ago

The tensorflow-text-nightly package does not have safety restrictions on the version of TF you have installed, so it will not replace the existing TF version.

s4sarath commented 3 years ago

Thanks. I have tried tensorflow-text-nightly as you suggested. But got following error

NotFoundError                             Traceback (most recent call last)
<ipython-input-1-7af2306084ed> in <module>
----> 1 import tensorflow_text as tf_text

~/.local/lib/python3.8/site-packages/tensorflow_text/__init__.py in <module>
     19 # pylint: disable=wildcard-import
     20 from tensorflow_text.python import keras
---> 21 from tensorflow_text.python import metrics
     22 from tensorflow_text.python.ops import *
     23 

~/.local/lib/python3.8/site-packages/tensorflow_text/python/metrics/__init__.py in <module>
     18 
     19 # pylint: disable=wildcard-import
---> 20 from tensorflow_text.python.metrics.text_similarity_metric_ops import *
     21 
     22 # Public symbols in the "tensorflow_text.metrics" package.

~/.local/lib/python3.8/site-packages/tensorflow_text/python/metrics/text_similarity_metric_ops.py in <module>
     26 from tensorflow.python.framework import load_library
     27 from tensorflow.python.platform import resource_loader
---> 28 gen_text_similarity_metric_ops = load_library.load_op_library(resource_loader.get_path_to_datafile('_text_similarity_metric_ops.so'))
     29 
     30 

/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename)
     56     RuntimeError: when unable to load the library or get the python wrappers.
     57   """
---> 58   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     59   try:
     60     wrappers = _pywrap_python_op_gen.GetPythonWrappers(

NotFoundError: libtensorflow_framework.so.2: cannot open shared object file: No such file or directory

If this tokenizer works, then I can train my model on TPU without creating TFRecords. Thats a significant breakthrough to be frank. :)

broken commented 3 years ago

@s4sarath Ugh! Apologies; there's a lot going on and I just realized you are trying to install on cloud tpu.. TF Text should be available on Cloud TPU by default. Can you test without trying to install a new version of tensorflow_text, and if it isn't available, create a new issue?

As an added benefit, the fix I referred to above should actually be available in the cloud instances already, as it's just our pip packages which are behind due to the Windows build issues.

s4sarath commented 3 years ago

No worries Robert. Maybe I wasn't clear enough.

By default tf-text is not available in TPU VM. I checked on v3-8 alphav2 machines on europe-west-4 region. I mailed TRC team regards support few days ago. Still haven't heard from them.

That's the reason I build it using bazel. Except tokenizers all other ops works so well inside tf.data.

On Wed, 14 Jul, 2021, 2:02 am Robert Neale, @.***> wrote:

@s4sarath https://github.com/s4sarath Ugh! Apologies; there's a lot going on and I just realized you are trying to install on cloud tpu.. TF Text should be available on Cloud TPU by default. Can you test without trying to install a new version of tensorflow_text, and if it isn't available, create a new issue?

As an added benefit, the fix I referred to above should actually be available in the cloud instances already, as it's just our pip packages which are behind due to the Windows build issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/text/issues/374#issuecomment-879382472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KB2CM3LL4BU4GTQSYTTXSPFDANCNFSM4QHPKXQQ .

s4sarath commented 3 years ago

A sample code to reproduce.

import tensorflow as tf
import tensorflow_text as tf_text
model_file_path = 'sample/spiece.model'
dtype = tf.int32
nbest_size = 0
alpha = 1.0

def _create_tokenizer(model_serialized_proto, dtype, nbest_size, alpha):
    return tf_text.SentencepieceTokenizer(
        model=model_serialized_proto,
        out_type=dtype,
        nbest_size=nbest_size,
        alpha=alpha)

model_serialized_proto = tf.io.gfile.GFile(model_file_path,
                                                       "rb").read()

tokenizer_sp = _create_tokenizer(model_serialized_proto, 
                             dtype,
                             nbest_size,
                             alpha)

def map_tokenize(text):
    return tokenizer_sp.tokenize(text)

# Read wikipedia data

dataset = tf.data.Dataset.from_tensor_slices(['This is text 1', 'This is text2', 'This is text3', 'This is text4'])
ds = dataset.map(map_tokenize)

Error is as follows in TPU VM. The moment map + tokenizer executes


https://symbolize.stripped_domain/r/?trace=7f74312c64bf,7f7655bac20f,7f74312c6601,7f74312c7326,7f7352004f3b,7f74312bdf5c,7f74310c7368,7f74310c8aa0,7f74310c97e7,7f7425f63524,7f7412c6a156,7f7412c7147b,5f2fb8,902aff&map=b7c22d7954df6b6961e4435041132cf899ee4a5e:7f7421f01000-7f7435c00270 
*** SIGSEGV (@0x14), see gl__________25#s15 received by PID 72222 (TID 72222) on cpu 56; stack trace: ***
PC: @     0x7f74312c64bf  (unknown)  tensorflow::AttrSlice::Find()
    @     0x7f74213f71e0        976  (unknown)
    @     0x7f7655bac210  (unknown)  (unknown)
    @     0x7f74312c6602         80  tensorflow::AttrSlice::Find()
    @     0x7f74312c7327         64  tensorflow::GetNodeAttr()
    @     0x7f7352004f3c        112  std::_Function_handler<>::_M_invoke()
    @     0x7f74312bdf5d         64  tensorflow::shape_inference::InferenceContext::Run()
    @     0x7f74310c7369        544  tensorflow::ShapeRefiner::RunShapeFn()
    @     0x7f74310c8aa1        352  tensorflow::ShapeRefiner::AddNodeInternal()
    @     0x7f74310c97e8         32  tensorflow::ShapeRefiner::AddNode()
    @     0x7f7425f63525        160  TF_FinishOperation
    @     0x7f7412c6a157        144  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7f7412c7147c        720  pybind11::cpp_function::dispatcher()
    @           0x5f2fb9  (unknown)  PyCFunction_Call
    @           0x902b00  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f74312c64bf,7f74213f71df,7f7655bac20f,7f74312c6601,7f74312c7326,7f7352004f3b,7f74312bdf5c,7f74310c7368,7f74310c8aa0,7f74310c97e7,7f7425f63524,7f7412c6a156,7f7412c7147b,5f2fb8,902aff&map=b7c22d7954df6b6961e4435041132cf899ee4a5e:7f7421f01000-7f7435c00270,ca1b7ab241ee28147b3d590cadb5dc1b:7f74146f8000-7f742172ab20 
E0714 03:05:42.738522   72222 coredump_hook.cc:292] RAW: Remote crash data gathering hook invoked.
E0714 03:05:42.738555   72222 coredump_hook.cc:384] RAW: Skipping coredump since rlimit was 0 at process start.
E0714 03:05:42.738568   72222 client.cc:222] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0714 03:05:42.738574   72222 coredump_hook.cc:447] RAW: Sending fingerprint to remote end.
E0714 03:05:42.738586   72222 coredump_socket.cc:124] RAW: Stat failed errno=2 on socket /var/google/services/logmanagerd/remote_coredump.socket
E0714 03:05:42.738598   72222 coredump_hook.cc:451] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] Missing crash reporting socket. Is the listener running?
E0714 03:05:42.738607   72222 coredump_hook.cc:525] RAW: Discarding core.
E0714 03:05:42.954572   72222 process_state.cc:771] RAW: Raising signal 11 with default behavior
Segmentation fault (core dumped)
s4sarath commented 3 years ago

So one more update. I tried a hack to wrap it inside tf.py_function, like this

def map_tokenize(text):
    #text = text.numpy().decode().strip()
    return tokenizer_sp.tokenize(text).merge_dims(-1, 1).to_tensor()

def map_tokenize_py(text):
    input_ids = tf.py_function(map_tokenize, [text],
                tf.int32)
    return [input_ids]

# Read wikipedia data

dataset = tf.data.Dataset.from_tensor_slices(['This is text 1', 'This is text2 waw wdxce', 'This is text3', 'This is text4'])
dataset = dataset.batch(2)
ds = dataset.map(map_tokenize_py)

Now it works ( it is slow due to py_func ). But the moment we wrap the dataset in tf.strategy.experimental_distribute , I get following error. Somehow, SentencePiece is not placing on multiple TPU devices, I guesss ( I am no expert ) .

2021-07-14 04:00:46.163813: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at sentencepiece_kernels.cc:275 : Not found: Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist.
2021-07-14 04:00:46.164110: W tensorflow/core/framework/op_kernel.cc:1680] Unknown: NotFoundError: Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist. [Op:SentencepieceTokenizeOp]
Traceback (most recent call last):

  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 247, in __call__
    return func(device, token, args)

  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 135, in __call__
    ret = self._func(*args)

  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "<ipython-input-24-9efe8b362f62>", line 3, in map_tokenize
    return tokenizer_sp.tokenize(text).merge_dims(-1, 1).to_tensor()

  File "/home/sidhu/Libraries/text/tensorflow_text/python/ops/sentencepiece_tokenizer.py", line 151, in tokenize
    gen_sentencepiece_tokenizer.sentencepiece_tokenize_op(

  File "<string>", line 175, in sentencepiece_tokenize_op

  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 6901, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)

  File "<string>", line 3, in raise_from

tensorflow.python.framework.errors_impl.NotFoundError: Resource localhost/_0_SentencepieceOp/N10tensorflow4text12_GLOBAL__N_121SentencepieceResourceE does not exist. [Op:SentencepieceTokenizeOp]
broken commented 3 years ago

You're right. Cloud TPU has tf.Text by default, but it is not there with the custom TPU VMs. I talked with cloud tpu and the custom op support should be better with 2.6. Currently it's a monolithic build which is probably why tf.Text couldn't find the libtensorflow_framework.so file.

Regarding your latest error, I'll try to get somebody on the team to look at it if I can't find more time myself.

s4sarath commented 3 years ago

Thanks Robert. Much appreciated. 👍

On Fri, 16 Jul, 2021, 2:25 am Robert Neale, @.***> wrote:

You're right. Cloud TPU has tf.Text by default, but it is not there with the custom TPU VMs. I talked with cloud tpu and the custom op support should be better with 2.6. Currently it's a monolithic build which is probably why tf.Text couldn't find the libtensorflow_framework.so file.

Regarding your latest error, I'll try to get somebody on the team to look at it if I can't find more time myself.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/text/issues/374#issuecomment-881007307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6KDUAV6CPD3VYWCEMGTTX5DNFANCNFSM4QHPKXQQ .