techleadhd / chatgpt-retrieval

1.68k stars 802 forks source link

JSONDecodeError: Expecting value: line 1 column 2 (char 1) #41

Open presidentofyes12 opened 1 year ago

presidentofyes12 commented 1 year ago

I had the idea to make a dataset out of IBM's Project Codenet codes (a little less than 14 million in total). I converted them all into text files (ending in .txt rather than .py, .c, etc) and after a couple of prior issues with the encoding that I solved by removing about 1 million files that had incorrect encodings (reducing it to about 12 million files), I tried to run it again. It then gave another, different error:

Traceback (most recent call last):
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/partition/json.py", line 45, in partition_json
    dict = json.loads(file_text)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/impromise/ExternalDrive/chatgpt-retrieval-main.txt.txt/chatgpt.py", line 36, in <module>
    index = VectorstoreIndexCreator().from_loaders([loader])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/indexes/vectorstore.py", line 81, in from_loaders
    docs.extend(loader.load())
                ^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/directory.py", line 156, in load
    self.load_file(i, p, docs, pbar)
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/directory.py", line 105, in load_file
    raise e
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/directory.py", line 99, in load_file
    sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 86, in load
    elements = self._get_elements()
               ^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 172, in _get_elements
    return partition(filename=self.file_path, **self.unstructured_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/partition/auto.py", line 230, in partition
    elements = partition_json(filename=filename, file=file, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/documents/elements.py", line 138, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 519, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/partition/json.py", line 48, in partition_json
    raise ValueError("Not a valid json")
ValueError: Not a valid json

This has had me stuck for a while. It had previously made such an error beforehand, and I decided that I'd move away certain folders that had the issue to be fixed later, but there were 4053 overall folders and I couldn't get move all problematic folders. I also looked through my dataset to make sure there were no shell files, JSON files, or CSV files that could cause such an issue, but there were none- it's just .txt, the cat.pdf file, and a Word doc. Without the dataset, I've been able to run the program with little difficulty, and even certain (if not most) folders were suitable as datasets to use.

Why is this error happening and how could I fix it? I am using Ubuntu 22.04, and Python 3.11.4. I've placed the program files onto an external hard drive, where the program has been able to run. If the error is unable to be fixed, are there any ways to circumvent the error?

Junaid-Nazir-828 commented 1 year ago

json.loads() take a json string and converts it into dictionary. By json string we mean json object but enclosed in quotes such as "{"name":"John", "age":30, "car":null}". Here you can't access any value using dictionary method since its wrapped in quotes so it acts like string. Thats why you convert it into dictionary by using json.loads(STRING_NAME)

The text you are loading does not contain data in json form.