raise ValueError("All arrays must be of the same length")

fishfree commented 4 months ago

(prujuai) wzzhang@ubuntugpu:~/prujuai$ python3 read_to_vectorstore.py
Converting to text: prujuai_moodle/praat_manual->praat_manual.pdf
An error occurred when processing the file prujuai_moodle/praat_manual->praat_manual.pdf: The command `pdf2txt.py prujuai_moodle/praat_manual->praat_manual.pdf` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Traceback (most recent call last):\n  File "/mnt/data/meme/.local/bin/pdf2txt.py", line 115, in <module>\n    if __name__ == \'__main__\': sys.exit(main(sys.argv))\n  File "/mnt/data/meme/.local/bin/pdf2txt.py", line 106, in main\n    for page in PDFPage.get_pages(fp, pagenos,\n  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 125, in get_pages\n    raise PDFTextExtractionNotAllowed(\'Text extraction is not allowed: %r\' % fp)\npdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name=\'prujuai_moodle/praat\xe4\xbd\xbf\xe7\x94\xa8\xe6\x89\x8b\xe5\x86\x8c->praat\xe4\xbd\xbf\xe7\x94\xa8\xe6\x89\x8b\xe5\x86\x8c.pdf\'>\n'. Unsupported file type?
Traceback (most recent call last):
  File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 214, in <module>
    df = main()
  File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 141, in main
    df = create_chunck_dataframe(material_headings, texts)
  File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 56, in create_chunck_dataframe
    df = pd.DataFrame({'Heading': material_headings, 'Text': texts})
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 733, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

fishfree commented 4 months ago

I juse noticed there were files of which names were in Chinese. I renamed all of theme into English filenames. However, the error still occured as below:

...
Converting to text: prujuai_moodle/praat_manual.pdf
An error occurred when processing the file prujuai_moodle/praat_manual.pdf: The command `pdf2txt.py prujuai_moodle/praat_manual.pdf` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Traceback (most recent call last):\n  File "/mnt/data/meme/.local/bin/pdf2txt.py", line 115, in <module>\n    if __name__ == \'__main__\': sys.exit(main(sys.argv))\n  File "/mnt/data/meme/.local/bin/pdf2txt.py", line 106, in main\n    for page in PDFPage.get_pages(fp, pagenos,\n  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 125, in get_pages\n    raise PDFTextExtractionNotAllowed(\'Text extraction is not allowed: %r\' % fp)\npdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name=\'prujuai_moodle/praat_manual.pdf\'>\n'. Unsupported file type?
Converting to text: prujuai_moodle/lcmc.pdf
Converting to text: prujuai_moodle/antconc.docx
An error occurred when processing the file prujuai_moodle/antconc.docx: "There is no item named 'word/document.xml' in the archive". Unsupported file type?
Converting to text: prujuai_moodle/praat.pptx
Converting to text: prujuai_moodle/antconc3.pdf
Converting to text: prujuai_moodle/CQP-syntax-tutorial.pdf
Converting to text: prujuai_moodle/CQPwebAdminManual.pdf
Converting to text: prujuai_moodle/cqpweb-consise.pdf
Traceback (most recent call last):
  File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 214, in <module>
    df = main()
  File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 141, in main
    df = create_chunck_dataframe(material_headings, texts)
  File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 56, in create_chunck_dataframe
    df = pd.DataFrame({'Heading': material_headings, 'Text': texts})
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 733, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

jaluoma commented 4 months ago

This could have something to do with the fact that the app is un able to read the PDF file. More generally, I have also encountered random errors every now an then. Exception and error handling should be much better when ingesting files.

jaluoma / pruju-ai

raise ValueError("All arrays must be of the same length") #7