run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.03k stars 5.31k forks source link

[Bug]: see i am facing agai and agin the same error of "AttributeError: 'Document' object has no attribute 'get_doc_id'" no matter whichever way i tried #15785

Open alishashajahan opened 2 months ago

alishashajahan commented 2 months ago

Bug Description

, convertimg each entry into a Document object, and then creates an index from these documents using the GPTVectorStoreIndex.

The purpose is to prepare the data for efficient querying and analysis using the GPT model.

import json from llama_index.legacy.indices.vector_store.base import GPTVectorStoreIndex

Define the Document class (if not already defined)

class Document: def init(self, post_id, subreddit, created_utc, post_title, num_comments, upvote_ratio, created_date, created_year, comments, sentiment, emotion): self.post_id = post_id self.subreddit = subreddit self.created_utc = created_utc self.post_title = post_title self.num_comments = num_comments self.upvote_ratio = upvote_ratio self.created_date = created_date self.created_year = created_year self.comments = comments self.sentiment = sentiment self.emotion = emotion

Load the JSON file

with open(r'C:\Users\D E L L\Desktop\merged_finalfile\completd_final_fornatted_jasonfile.json') as file: data = json.load(file)

Convert the JSON data to a list of Document objects

documents = [Document( post_id=doc['post_id'], subreddit=doc['subreddit'], created_utc=doc['created_utc'], post_title=doc['post_title'], num_comments=doc['num_comments'], upvote_ratio=doc['upvote_ration'], # Corrected key created_date=doc['created_date'], created_year=doc['created_year'], comments=doc['comments'], sentiment=doc['sentiment_y'], emotion=doc['emotion'] ) for doc in data]

Create the index from the documents

index = GPTVectorStoreIndex.from_documents(documents)

Version

0.11.1

Steps to Reproduce

, convertimg each entry into a Document object, and then creates an index from these documents using the GPTVectorStoreIndex.

The purpose is to prepare the data for efficient querying and analysis using the GPT model.

import json from llama_index.legacy.indices.vector_store.base import GPTVectorStoreIndex

Define the Document class (if not already defined)

class Document: def init(self, post_id, subreddit, created_utc, post_title, num_comments, upvote_ratio, created_date, created_year, comments, sentiment, emotion): self.post_id = post_id self.subreddit = subreddit self.created_utc = created_utc self.post_title = post_title self.num_comments = num_comments self.upvote_ratio = upvote_ratio self.created_date = created_date self.created_year = created_year self.comments = comments self.sentiment = sentiment self.emotion = emotion

Load the JSON file

with open(r'C:\Users\D E L L\Desktop\merged_finalfile\completd_final_fornatted_jasonfile.json') as file: data = json.load(file)

Convert the JSON data to a list of Document objects

documents = [Document( post_id=doc['post_id'], subreddit=doc['subreddit'], created_utc=doc['created_utc'], post_title=doc['post_title'], num_comments=doc['num_comments'], upvote_ratio=doc['upvote_ration'], # Corrected key created_date=doc['created_date'], created_year=doc['created_year'], comments=doc['comments'], sentiment=doc['sentiment_y'], emotion=doc['emotion'] ) for doc in data]

Create the index from the documents

index = GPTVectorStoreIndex.from_documents(documents)

Relevant Logs/Tracbacks

output i got as error: ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[48], line 41
     26 documents = [Document(
     27     post_id=doc['post_id'],
     28     subreddit=doc['subreddit'],
   (...)
     37     emotion=doc['emotion']
     38 ) for doc in data]
     40 # Create the index from the documents
---> 41 index = GPTVectorStoreIndex.from_documents(documents)

File ~\anaconda3\Lib\site-packages\llama_index\legacy\indices\base.py:106, in BaseIndex.from_documents(cls, documents, storage_context, service_context, show_progress, **kwargs)
    104 with service_context.callback_manager.as_trace("index_construction"):
    105     for doc in documents:
--> 106         docstore.set_document_hash(doc.get_doc_id(), doc.hash)
    108     nodes = run_transformations(
    109         documents,  # type: ignore
    110         service_context.transformations,
    111         show_progress=show_progress,
    112         **kwargs,
    113     )
    115     return cls(
    116         nodes=nodes,
    117         storage_context=storage_context,
   (...)
    120         **kwargs,
    121     )

AttributeError: 'Document' object has no attribute 'get_doc_id'
dosubot[bot] commented 2 months ago

I found a similar closed issue: AttributeError: 'TextNode' object has no attribute 'get_doc_id'. The suggested solution is to ensure that the Document class you are using has the doc_id attribute properly set and to replace any calls to get_doc_id with the doc_id property.

Here is a corrected version of your code:

import json
from llama_index.legacy.indices.vector_store.base import GPTVectorStoreIndex
from llama_index import Document  # Import the correct Document class

# Load the JSON file
with open(r'C:\Users\D E L L\Desktop\merged_finalfile\completd_final_fornatted_jasonfile.json') as file:
    data = json.load(file)

# Convert the JSON data to a list of Document objects
documents = [Document(
    text=doc['post_title'],  # Assuming 'post_title' is the main text content
    doc_id=doc['post_id'],  # Set the doc_id
    metadata={
        'subreddit': doc['subreddit'],
        'created_utc': doc['created_utc'],
        'num_comments': doc['num_comments'],
        'upvote_ratio': doc['upvote_ration'],  # Corrected key
        'created_date': doc['created_date'],
        'created_year': doc['created_year'],
        'comments': doc['comments'],
        'sentiment': doc['sentiment_y'],
        'emotion': doc['emotion']
    }
) for doc in data]

# Create the index from the documents
index = GPTVectorStoreIndex.from_documents(documents)

Make sure you are using the Document class from the llama_index module, which includes the doc_id property. This should resolve the AttributeError you are encountering [1].

To continue talking to Dosu, mention @dosu.

alishashajahan commented 2 months ago

@dosu again facing the same errors..

dosubot[bot] commented 2 months ago

The error persists because the custom Document class you defined does not have the get_doc_id method or the doc_id attribute that the GPTVectorStoreIndex expects. You should use the Document class provided by the llama_index module, which includes the necessary attributes and methods.

Here is the corrected version of your code:

import json
from llama_index.legacy.indices.vector_store.base import GPTVectorStoreIndex
from llama_index import Document  # Import the correct Document class

# Load the JSON file
with open(r'C:\Users\D E L L\Desktop\merged_finalfile\completd_final_fornatted_jasonfile.json') as file:
    data = json.load(file)

# Convert the JSON data to a list of Document objects
documents = [Document(
    text=doc['post_title'],  # Assuming 'post_title' is the main text content
    doc_id=doc['post_id'],  # Set the doc_id
    metadata={
        'subreddit': doc['subreddit'],
        'created_utc': doc['created_utc'],
        'num_comments': doc['num_comments'],
        'upvote_ratio': doc['upvote_ration'],  # Corrected key
        'created_date': doc['created_date'],
        'created_year': doc['created_year'],
        'comments': doc['comments'],
        'sentiment': doc['sentiment_y'],
        'emotion': doc['emotion']
    }
) for doc in data]

# Create the index from the documents
index = GPTVectorStoreIndex.from_documents(documents)

Ensure you are using the Document class from the llama_index module, which includes the doc_id property and other necessary methods [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 2 months ago

@alishashajahan Yes, Document has no attribute get_doc_id because you made your own document class, which will not work with anything in llama-index

Best to use the document class thats already in llama-index no? You can add all this info as metadata, as dosu was trying to suggest