peterw / Chat-with-Github-Repo

This repository contains two Python scripts that demonstrate how to create a chatbot using Streamlit, OpenAI GPT-3.5-turbo, and Activeloop's Deep Lake.
https://explodinginsights.com/
MIT License
1.14k stars 169 forks source link

Deeplake Transform failed error #6

Closed sai-krishna-msk closed 1 year ago

sai-krishna-msk commented 1 year ago

Following is the entire error thread

fatal: destination path './gumroad' already exists and is not an empty directory.
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/sai13579/code_repo_qa2

hub://sai13579/code_repo_qa2 loaded successfully.

Deep Lake Dataset in hub://sai13579/code_repo_qa2 already exists, loading from the storage
Dataset(path='hub://sai13579/code_repo_qa2', tensors=[])

 tensor    htype    shape    dtype  compression
 -------  -------  -------  -------  -------
Evaluating ingest: 0%|                                                                                    | 0/1 [00:02<? 
Error in sys.excepthook:
Traceback (most recent call last):
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\humbug\report.py", line 540, in _hook 
    self.error_report(error=exception_instance, tags=tags, publish=publish)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\humbug\report.py", line 274, in error_report
    traceback.format_exception(
TypeError: format_exception() got an unexpected keyword argument 'etype'

Original exception was:
Traceback (most recent call last):
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform_tensor.py", line 117, in append
    raise TensorDoesNotExistError(self.name)
deeplake.util.exceptions.TensorDoesNotExistError: "Tensor 'text' does not exist."

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\util\transform.py", line 207, in _transform_and_append_data_slice
    out = transform_sample(sample, pipeline, tensors)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\util\transform.py", line 75, 
in transform_sample
    fn(out, result, *args, **kwargs)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\langchain\vectorstores\deeplake.py", line 219, in ingest
    sample_out.append(
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform_dataset.py", line 67, in append
    self[k].append(v)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform_tensor.py", line 127, in append
    raise SampleAppendError(self.name, item) from e
deeplake.util.exceptions.SampleAppendError: Failed to append the sample [core]
        repositoryformatversion = 0
        filemode = false
        bare = false
        logallrefupdates = true
        symlinks = false
        ignorecase = true
[remote "origin"]
        url = https://github.com/sai-krishna-msk/VtopScrapper
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master to the tensor 'text'. See more details in the traceback.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\320117176\OneDrive - Philips\Documents\projects\ai_agent\Chat-with-Github-Repo\github.py", line 53, in <module>
    main(repo_url, root_dir, deeplake_repo_name, deeplake_username)
  File "c:\Users\320117176\OneDrive - Philips\Documents\projects\ai_agent\Chat-with-Github-Repo\github.py", line 44, in main
    db.add_documents(texts)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\langchain\vectorstores\base.py", line 
61, in add_documents
    return self.add_texts(texts, metadatas, **kwargs)
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\langchain\vectorstores\deeplake.py", line 236, in add_texts
    ingest().eval(
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform.py", line 99, in eval
    pipeline.eval(
  File "C:\Users\320117176\AppData\Local\anaconda3\envs\ai_agent\lib\site-packages\deeplake\core\transform\transform.py", line 298, in eval
    raise TransformError(
deeplake.util.exceptions.TransformError: Transform failed at index 0 of the input data on the item: [('[core]\n\trepositoryformatversion = 0\n\tfilemo...n\\HEAD'}, 'a217eccf-e42f-11ed-94dd-f47b099e160e')]. See traceback for more details. 

db.add_documents(texts)



Can anyone please help me understand and resolve this issue 

Thank you in advance 🙌
FayazRahman commented 1 year ago

Hey @sai-krishna-msk, it looks like your dataset has no tensors. You can create tensors using ds.create_tensor. Do tell me if you need more help!

sai-krishna-msk commented 1 year ago

Hey @sai-krishna-msk, it looks like your dataset has no tensors. You can create tensors using ds.create_tensor. Do tell me if you need more help!

@FayazRahman , Thank you for swift response.

I'm sorry but i have never worked with deeplake package before, I am not aware of what the issue still is, Can you kindly tell me what i am missing(When you say my dataset does not have tensor, do you mean the GitHub repo i am working with has no code ?). If and when you have time can you please elaborate on that and also point me in the direction where i have to modify the code.

Your help is much appreciated

On a side note, I was able to make the code work,

So first I tried with my private repo's code(lets call it repo-1), It was throwing the error I specified above, So i tried to use another public repo(lets call it repo-2), but still it was not working, so i did some debugging and found out despite of me changing the URL to repo-2, The code was working with repo-1. but when i had deleted the gumroad directory(Which the code creates to store repo files) the code is now working with repo-2.

Keeping the bug aside, I am still trying to figure out why the code did not work with repo-1.

I will post an update if I found out.

But if anyone else figures out, please let me know. Thank you in advance.

sanchitram1 commented 1 year ago

Had a new script where I ran this, and it worked

import deeplake 
api_key = os.getenv("<deeplake_api>")

# create an empty "data store" on deeplake. overwrite=True so I could keep reusing it
ds = deeplake.empty('hub://<your organization from deeplake>/<whatever you want to call it>', token=api_key, overwrite=True)

# create tensors mimicking the output sample from github.py
ds.create_tensor("ids")
ds.create_tensor("metadata")
ds.create_tensor("embedding")
ds.create_tensor("text", htype="text")

IMO It's worth adding to the instructions, but I think what's going on here is that the github.py scripts outputs tensors in the following layout ['ids', 'metadata', 'embedding', 'text'], so you need to mimic that structure in your deeplake datastore.

sai-krishna-msk commented 1 year ago

Thank you @sanchitram1, I think that should fix it.

I could not figure out the issue but based on error messages it was clear that it was deeplake issue, So I swapped out Deeplake as a vector database with Pinecone.

It is currently working with pinecone, which I found to be much simpler to work with as compared to Deeplake(although I am sure there are reasonable tradeoffs between Deeplake and Pinecone)

Here is the working code of the same project but with pinecone, Pinecone version of Chat-with-Github

Note Hi @peterw , I have credited you in my repo, Please let me know if it is not suffice. I'll do the necessary