protectai / llm-guard

The Security Toolkit for LLM Interactions
https://llm-guard.com/
MIT License
1.13k stars 141 forks source link

nltk release 3.8.2 breaking change #177

Closed nicoADSP closed 3 weeks ago

nicoADSP commented 1 month ago

Describe the bug 4 days ago nltk did a breaking change in the 3.8.2 release. The issue is described here. This causes any applications which depend on llm-guard to crash with the following error:

  File "/home/adsp/venv/lib/python3.11/site-packages/llm_guard/evaluate.py", line 51, in scan_prompt
    sanitized_prompt, is_valid, risk_score = scanner.scan(sanitized_prompt)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adsp/venv/lib/python3.11/site-packages/llm_guard/input_scanners/toxicity.py", line 100, in scan
    inputs = self._match_type.get_inputs(prompt)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adsp/venv/lib/python3.11/site-packages/llm_guard/input_scanners/toxicity.py", line 45, in get_inputs
    return split_text_by_sentences(prompt)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adsp/venv/lib/python3.11/site-packages/llm_guard/util.py", line 231, in split_text_by_sentences
    return nltk.sent_tokenize(text.strip())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adsp/venv/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = PunktTokenizer(language)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adsp/venv/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
    self.load_lang(lang)
  File "/home/adsp/venv/lib/python3.11/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adsp/venv/lib/python3.11/site-packages/nltk/data.py", line 582, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/home/adsp/nltk_data'
    - '/home/adsp/venv/nltk_data'
    - '/home/adsp/venv/share/nltk_data'
    - '/home/adsp/venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

in llm-guard's pyproject.toml filen nltk's version is specified as nltk>=3.8,<4 which is causing my application to install llm-guard with nltk version 3.8.2

I believe a quick patch would be to just pin nltk to version 3.8.1, until a better solution is implemented

To Reproduce Spin up llm-guard and attempt to use scan_prompt

Expected behavior The breaking change from nltk should be handled by llm-guard so llm-guard does not break

Thanks everyone! please let me know if you need further information

asofter commented 3 weeks ago

Hey @nicoADSP , thanks for reporting it with the details. I upgraded the version.