su77ungr / CASALIOY

♾️ toolkit for air-gapped LLMs on consumer-grade hardware
Apache License 2.0
230 stars 31 forks source link

data privacy #115

Closed bp020108 closed 1 year ago

bp020108 commented 1 year ago

Issue you'd like to raise.

this project will have any data leak for local documents?

Suggestion:

No response

su77ungr commented 1 year ago

Not at all. Once your installation is done it does not require any connection at all. Thus an internet connection is not required. It's as air-gapped as it will get.

If it is of any need I could package up an installation that runs on a USB Medium like tails OS without the need to even have a connection to begin with.

su77ungr commented 1 year ago

If any questions remain feel free to respond here and I'll reopen the issue.

bp020108 commented 1 year ago

Thanks for reply.

Is this project has issue with ingest.py to embedded documents more than 3 documents.

I have tried other project on my hp server (22 cores wihout GPU) but ingest.py is not able to complete the process if we add more than 3 documents.

And do we run same project on conda env in Ubuntu server instead of docker?

Do we need to clone git repo or is there part of procedure?

On Thu, Aug 17, 2023, 7:54 AM su77ungr @.***> wrote:

Closed #115 https://github.com/su77ungr/CASALIOY/issues/115 as completed.

— Reply to this email directly, view it on GitHub https://github.com/su77ungr/CASALIOY/issues/115#event-10118554367, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB3XG3AUWAMJYR3N4HPLIHDXVYBAHANCNFSM6AAAAAA3TIQ2T4 . You are receiving this because you authored the thread.Message ID: @.***>

su77ungr commented 1 year ago

This project's test folder already has more than 10 different media types that can be ingested within millisecond time via multithreaded ingestion. We beat PrivateGPT in performance. Also we chose qdrant which should be way more performant when it comes to mmr.

This repo does not use "GPT" in its name hence people with less knowledge tend to skip on it.

Either use docker or refer to the installation from source. That's also done within a minute. I can't serve any guide in regards to conda. I would recommend installing it from source.

bp020108 commented 1 year ago

thanks for info. this is for local GPT to feed with own documents without outside API or internet to keep data safe.

I am seeing below error: can you please help:

(casalioy-py3.11) root@47c87b8ed509:/srv/CASALIOY# python3.11 casalioy/ingest.py found local model dir at models/sentence-transformers/all-MiniLM-L6-v2 found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin

Delete current database?(Y/N): Y Deleting db... Scanning files found local model dir at models/sentence-transformers/all-MiniLM-L6-v2 ] 0/ 8 eta [?:??:??] found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin found local model dir at models/sentence-transformers/all-MiniLM-L6-v2 found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin found local model dir at models/sentence-transformers/all-MiniLM-L6-v2 found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin [nltk_data] Downloading package punkt to /root/nltk_data... ] 2/ 8 eta [00 14 [nltk_data] Downloading package punkt to /root/nltk_data...=====================================> 4 05 [nltk_data] Downloading package punkt to /root/nltk_data... 6 [nltk_data] Unzipping tokenizers/punkt.zip. 7 [nltk_data] Error with downloaded zip file 50.0% [=======================================================================================> ] 4/ 8 eta [00:00] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, kwds)) ^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/casalioy/ingest.py", line 125, in process_one_doc document = self.load_one_doc(filepath) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/casalioy/ingest.py", line 74, in load_one_doc return self.file_loadersfilepath.suffix[1:].load() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 25, in _get_elements return partition_md(filename=self.file_path, self.unstructured_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/md.py", line 52, in partition_md return partition_html( ^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/html.py", line 91, in partition_html layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/common.py", line 73, in document_to_element_list num_pages = len(document.pages) ^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages self._pages = self._read() ^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 116, in _read element = _parse_tag(tag_elem) ^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 222, in _parse_tag return _text_to_element(text, tag_elem.tag, ancestortags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 237, in _text_to_element elif is_narrative_tag(text, tag): ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 265, in is_narrative_tag return tag not in HEADING_TAGS and is_possible_narrative_text(text) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text if exceeds_cap_ratio(text, threshold=cap_threshold): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio if sentence_count(text, 3) > 1: ^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 222, in sentence_count sentences = sent_tokenize(text) ^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize return _sent_tokenize(text) ^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/tokenize/init.py", line 106, in sent_tokenize tokenizer = load(f"tokenizers/punkt/{language}.pickle") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/data.py", line 750, in load opened_resource = _open(resource_url) ^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/data.py", line 876, in open return find(path, path + [""]).open() ^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/nltk/data.py", line 583, in find raise LookupError(resource_not_found) LookupError:


Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('punkt')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt/PY3/english.pickle

Searched in:

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/srv/CASALIOY/casalioy/ingest.py", line 170, in main(sources_directory, cleandb) File "/srv/CASALIOY/casalioy/ingest.py", line 164, in main ingester.ingest_from_directory(sources_directory, chunk_size, chunk_overlap) File "/srv/CASALIOY/casalioy/ingest.py", line 144, in ingest_from_directory for embeddings in pb(pool.imap_unordered(self.process_one_doc, all_items), total=len(all_items)): File "/srv/CASALIOY/.venv/lib/python3.11/site-packages/prompt_toolkit/shortcuts/progress_bar/base.py", line 353, in iter for item in self.data: File "/usr/local/lib/python3.11/multiprocessing/pool.py", line 873, in next raise value LookupError:


Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('punkt')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt/PY3/english.pickle

Searched in:

(casalioy-py3.11) root@47c87b8ed509:/srv/CASALIOY# /usr/local/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

(casalioy-py3.11) root@47c87b8ed509:/srv/CASALIOY#

su77ungr commented 1 year ago

'punkt' is not available or not downloaded in your environment. Did you try python3, then import nltk nltk.download('punkt')

As said you should use the installation from source. Docker might cause issues besides Windows 11H2 - Ubuntu is unstable.

To install please refer to the ReadMe git clone https://github.com/su77ungr/CASALIOY && cd CASALIOY/

python -m pip install poetry python -m poetry config virtualenvs.in-project true python -m poetry install . .venv/bin/activate python -m pip install --force streamlit sentence_transformers # Temporary bandaid fix, waiting for streamlit >=1.23 pre-commit install

bp020108 commented 1 year ago

Urgent:

docker is bypassing ubuntu host UFW firewall? docker is able to access internet? can you please help why? I want to block internet after downloading docker images.

If i paste documents in source directory in docker then it will accessed by internet? if internet access is there in docker then source documents will be accessed from outside?

su77ungr commented 1 year ago

Just install it on a vm and disable the internet connection. DON'T USE DOCKER if you don't know how to firewall it or how to use it.

bp020108 commented 1 year ago

As i mentioned i already disabled from Ubuntu host by UFW firewall. But i think i need to specifically disable from docker.

If i do not disable from docker then what is the risk? Source directory or my docker container is accessible from outside?

I am new in the docker and i want to use but need your help to understand risk from your project.

On Fri, Aug 18, 2023, 10:07 AM su77ungr @.***> wrote:

Just install it on a vm and disable the internet connection. DON'T USE DOCKER if you don't know how to firewall it or how to use it.

— Reply to this email directly, view it on GitHub https://github.com/su77ungr/CASALIOY/issues/115#issuecomment-1683974976, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB3XG3FSRJCKMDPTLZ2XYMLXV5ZJXANCNFSM6AAAAAA3TIQ2T4 . You are receiving this because you authored the thread.Message ID: @.***>