nlmatics / nlm-ingestor

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.
https://www.nlmatics.com
Apache License 2.0
1.11k stars 159 forks source link

Local only use #15

Open maximedb opened 9 months ago

maximedb commented 9 months ago

Hello,

Thank you for this nice library.

Is there a way to use the nlm-ingestor without an internet connection? It seem to download a tokenizer from openai. I get the following error:

  File "/opt/conda/envs/DWS-CPU/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_ingestor/ingestion_daemon/__main__.py", line 8, in <module>
    from nlm_ingestor.ingestor import ingestor_api
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_ingestor/ingestor/__init__.py", line 3, in <module>
    from nlm_utils.utils import generate_version
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/__init__.py", line 4, in <module>
    import nlm_utils.model_client
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/model_client/__init__.py", line 1, in <module>
    from .classification import ClassificationClient
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/model_client/classification.py", line 10, in <module>
    from nlm_utils.model_client.flan_t5_client import FlanT5Client
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/model_client/flan_t5_client.py", line 7, in <module>
    from nlm_utils.utils.answer_type import answer_type_map
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/utils/__init__.py", line 8, in <module>
    from .utils import ensure_bool
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/nlm_utils/utils/utils.py", line 4, in <module>
    oai_tokenizer = tiktoken.get_encoding(
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/registry.py", line 73, in get_encoding
    enc = Encoding(**constructor())
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken_ext/openai_public.py", line 72, in cl100k_base
    mergeable_ranks = load_tiktoken_bpe(
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/load.py", line 147, in load_tiktoken_bpe
    contents = read_file_cached(tiktoken_bpe_file, expected_hash)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/load.py", line 64, in read_file_cached
    contents = read_file(blobpath)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/tiktoken/load.py", line 25, in read_file
    resp = requests.get(blobpath)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/envs/DWS-CPU/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fe6ded15d30>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Thank you, Maxime.

JSv4 commented 9 months ago

I think the issue here is TikToken needs to download some vocab files from OpenAI at runtime (the first time it runs). This SO thread has some suggestions to pre-fetch the cached data file that seem like they'd work, but FYI, I haven't tried them and can't vouch for them. If any of the proposed approaches work, it'd be great to share that with the community. Please report back!

roblen001 commented 9 months ago

I think the issue here is TikToken needs to download some vocab files from OpenAI at runtime (the first time it runs). This SO thread has some suggestions to pre-fetch the cached data file that seem like they'd work, but FYI, I haven't tried them and can't vouch for them. If any of the proposed approaches work, it'd be great to share that with the community. Please report back!

I can confirm this worked for me.