TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "sentencepiece_model.proto": sentencepiece_model.proto: A file with this name is already in the pool.

What version of protobuf and what language are you using? Version: v3.8.0 (NOTE: please try updating to the latest version of protoc/runtime possible beforehand to attempt to resolve your problem) Language: Python

What operating system (Linux, Windows, ...) and version? 16.04.6 LTS (GNU/Linux 4.4.0-142-generic x86_64)

What runtime / compiler are you using (e.g., python version or gcc version) Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:11:38) [GCC 7.3.0] :: Anaconda, Inc. on linux

What did you do? Steps to reproduce the behavior:

Load a LanguageModelingTransformer in lightning_transformer
load a dataset from BigBench using load_datasets() imported from datasets
See error (while trace stack differs, same error (TypeError) occurs even if I exchange the order, that is loading dataset first and loading the model after it)

What did you expect to see Dataset loaded successfully

What did you see instead? ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 2>:2 │ │ │ │ 1 # dataset = ["What the result of 1+3", "Calculate 4*253"] │ │ ❱ 2 dataset = load_dataset("bigbench", 'modified_arithmetic', cache_dir='data', split='valid │ │ 3 # dataset = dataset['validation']['inputs'][:] │ │ 4 │ │ 5 # Create a DataLoader │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/datasets/load.py:1773 in │ │ load_dataset │ │ │ │ 1770 │ ) │ │ 1771 │ │ │ 1772 │ # Create a dataset builder │ │ ❱ 1773 │ builder_instance = load_dataset_builder( │ │ 1774 │ │ path=path, │ │ 1775 │ │ name=name, │ │ 1776 │ │ data_dir=data_dir, │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/datasets/load.py:1512 in │ │ load_dataset_builder │ │ │ │ 1509 │ ) │ │ 1510 │ │ │ 1511 │ # Get dataset builder class from the processing script │ │ ❱ 1512 │ builder_cls = import_main_class(dataset_module.module_path) │ │ 1513 │ builder_kwargs = dataset_module.builder_kwargs │ │ 1514 │ data_files = builder_kwargs.pop("data_files", data_files) │ │ 1515 │ config_name = builder_kwargs.pop("config_name", name) │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/datasets/load.py:115 in │ │ import_main_class │ │ │ │ 112 │ - a DatasetBuilder if dataset is True │ │ 113 │ - a Metric if dataset is False │ │ 114 │ """ │ │ ❱ 115 │ module = importlib.import_module(module_path) │ │ 116 │ │ │ 117 │ if dataset: │ │ 118 │ │ main_cls_type = DatasetBuilder │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/importlib/init.py:127 in import_module │ │ │ │ 124 │ │ │ if character != '.': │ │ 125 │ │ │ │ break │ │ 126 │ │ │ level += 1 │ │ ❱ 127 │ return _bootstrap._gcd_import(name[level:], package, level) │ │ 128 │ │ 129 │ │ 130 _RELOADING = {} │ │ in _gcd_import:1014 │ │ in _find_and_load:991 │ │ in _find_and_load_unlocked:975 │ │ in _load_unlocked:671 │ │ in exec_module:783 │ │ in _call_with_frames_removed:219 │ │ │ │ /home/cxsun/.cache/huggingface/modules/datasets_modules/datasets/bigbench/d2757373c3fb6b35a846ee │ │ 951265c3f8fbf0124fb650b12cef5678cf902914d2/bigbench.py:22 in │ │ │ │ 19 │ │ 20 from typing import Optional │ │ 21 │ │ ❱ 22 import bigbench.api.util as bb_utils # From: "bigbench @ https://storage.googleapis.com │ │ 23 import bigbench.bbseqio.bigbench_bridge as bbb │ │ 24 from bigbench.api import json_task │ │ 25 from bigbench.bbseqio import bigbench_json_paths as bb_json_paths │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/bigbench/api/util.py:25 in │ │ │ │ │ │ 22 import json │ │ 23 import os │ │ 24 import bigbench.api.task as task_api │ │ ❱ 25 import bigbench.api.json_task as json_task │ │ 26 import bigbench.api.model as model_api │ │ 27 import bigbench.api.results as results_api │ │ 28 import bigbench.api.task_metrics as task_metrics │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/bigbench/api/json_task.py:26 in │ │ │ │ │ │ 23 │ │ 24 from bigbench.api import json_utils │ │ 25 import bigbench.api.task as task │ │ ❱ 26 import bigbench.api.task_metrics as metrics │ │ 27 import bigbench.api.results as results_api │ │ 28 import numpy as np │ │ 29 from scipy.special import logsumexp │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/bigbench/api/task_metrics.py:24 │ │ in │ │ │ │ 21 │ │ 22 from datasets import load_metric │ │ 23 from scipy.special import logsumexp │ │ ❱ 24 from t5.evaluation import metrics │ │ 25 from sklearn.metrics import f1_score │ │ 26 │ │ 27 │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/t5/init.py:17 in │ │ │ │ 14 │ │ 15 """Import API modules.""" │ │ 16 │ │ ❱ 17 import t5.data │ │ 18 import t5.evaluation │ │ 19 │ │ 20 # Version number. │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/t5/data/init.py:17 in │ │ │ │ │ │ 14 │ │ 15 """Import data modules.""" │ │ 16 # pylint:disable=wildcard-import,g-bad-import-order │ │ ❱ 17 from t5.data.dataset_providers import │ │ 18 from t5.data.glue_utils import │ │ 19 import t5.data.postprocessors │ │ 20 import t5.data.preprocessors │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/t5/data/dataset_providers.py:28 │ │ in │ │ │ │ 25 from collections.abc import Mapping │ │ 26 import re │ │ 27 │ │ ❱ 28 import seqio │ │ 29 from t5.data import utils │ │ 30 import tensorflow.compat.v2 as tf │ │ 31 │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/init.py:18 in │ │ │ │ 15 """Import to top-level API.""" │ │ 16 # pylint:disable=wildcard-import,g-bad-import-order │ │ 17 │ │ ❱ 18 from seqio.dataset_providers import * │ │ 19 from seqio import evaluation │ │ 20 from seqio import experimental │ │ 21 from seqio.evaluation import Evaluator │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/dataset_providers.py:38 in │ │ │ │ │ │ 35 import numpy as np │ │ 36 from packaging import version as version_lib │ │ 37 import pyglove as pg │ │ ❱ 38 from seqio import metrics as metrics_lib │ │ 39 from seqio import preprocessors as seqio_preprocessors │ │ 40 from seqio import task_registry_provenance_tracking │ │ 41 from seqio import utils │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/metrics.py:25 in │ │ │ │ 22 import clu.metrics │ │ 23 import flax │ │ 24 import numpy as np │ │ ❱ 25 from seqio import utils │ │ 26 import tensorflow.compat.v2 as tf │ │ 27 │ │ 28 │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/utils.py:29 in │ │ │ │ 26 │ │ 27 from absl import logging │ │ 28 import numpy as np │ │ ❱ 29 from seqio.vocabularies import Vocabulary │ │ 30 import tensorflow.compat.v2 as tf │ │ 31 import tensorflow_datasets as tfds │ │ 32 │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/seqio/vocabularies.py:25 in │ │ │ │ │ │ 22 import tensorflow.compat.v2 as tf │ │ 23 import tensorflow_text as tf_text │ │ 24 │ │ ❱ 25 from sentencepiece import sentencepiece_model_pb2 │ │ 26 import sentencepiece as sentencepiece_processor │ │ 27 │ │ 28 PAD_ID = 0 │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/sentencepiece/sentencepiece_mode │ │ l_pb2.py:16 in │ │ │ │ 13 │ │ 14 │ │ 15 │ │ ❱ 16 DESCRIPTOR = _descriptor.FileDescriptor( │ │ 17 name='sentencepiece_model.proto', │ │ 18 package='sentencepiece', │ │ 19 syntax='proto2', │ │ │ │ /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages/google/protobuf/descriptor.py:10 │ │ 24 in new │ │ │ │ 1021 │ │ except KeyError: │ │ 1022 │ │ raise RuntimeError('Please link in cpp generated lib for %s' % (name)) │ │ 1023 │ elif serialized_pb: │ │ ❱ 1024 │ │ return _message.default_pool.AddSerializedFile(serialized_pb) │ │ 1025 │ else: │ │ 1026 │ │ return super(FileDescriptor, cls).new(cls) │ │ 1027 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "sentencepiece_model.proto": sentencepiece_model.proto: A file with this name is already in the pool.

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Anything else we should know about your project / environment Dependencies mentioned:

Name: lightning-transformers Version: 0.2.5 Summary: Lightning Transformers. Home-page: https://github.com/Lightning-AI/lightning-transformers Author: Lightning AI et al. Author-email: pytorch@lightning.ai License: Apache-2.0 Location: /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages Requires: datasets, Pillow, pytorch-lightning, sentencepiece, torchmetrics, transformers Required-by:

Name: datasets Version: 2.12.0 Summary: HuggingFace community-driven open-source library of datasets Home-page: https://github.com/huggingface/datasets Author: HuggingFace Inc. Author-email: thomas@huggingface.co License: Apache 2.0 Location: /data2/cxsun/anaconda3/envs/llm_new/lib/python3.8/site-packages Requires: aiohttp, dill, fsspec, huggingface-hub, multiprocess, numpy, packaging, pandas, pyarrow, pyyaml, requests, responses, tqdm, xxhash Required-by: bigbench, lightning-transformers

protocolbuffers / protobuf

TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "sentencepiece_model.proto": sentencepiece_model.proto: A file with this name is already in the pool. #12882