Open fishfree opened 4 months ago
I juse noticed there were files of which names were in Chinese. I renamed all of theme into English filenames. However, the error still occured as below:
...
Converting to text: prujuai_moodle/praat_manual.pdf
An error occurred when processing the file prujuai_moodle/praat_manual.pdf: The command `pdf2txt.py prujuai_moodle/praat_manual.pdf` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Traceback (most recent call last):\n File "/mnt/data/meme/.local/bin/pdf2txt.py", line 115, in <module>\n if __name__ == \'__main__\': sys.exit(main(sys.argv))\n File "/mnt/data/meme/.local/bin/pdf2txt.py", line 106, in main\n for page in PDFPage.get_pages(fp, pagenos,\n File "/mnt/data/meme/.local/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 125, in get_pages\n raise PDFTextExtractionNotAllowed(\'Text extraction is not allowed: %r\' % fp)\npdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed: <_io.BufferedReader name=\'prujuai_moodle/praat_manual.pdf\'>\n'. Unsupported file type?
Converting to text: prujuai_moodle/lcmc.pdf
Converting to text: prujuai_moodle/antconc.docx
An error occurred when processing the file prujuai_moodle/antconc.docx: "There is no item named 'word/document.xml' in the archive". Unsupported file type?
Converting to text: prujuai_moodle/praat.pptx
Converting to text: prujuai_moodle/antconc3.pdf
Converting to text: prujuai_moodle/CQP-syntax-tutorial.pdf
Converting to text: prujuai_moodle/CQPwebAdminManual.pdf
Converting to text: prujuai_moodle/cqpweb-consise.pdf
Traceback (most recent call last):
File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 214, in <module>
df = main()
File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 141, in main
df = create_chunck_dataframe(material_headings, texts)
File "/mnt/data/meme/prujuai/read_to_vectorstore.py", line 56, in create_chunck_dataframe
df = pd.DataFrame({'Heading': material_headings, 'Text': texts})
File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 733, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
File "/mnt/data/meme/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
This could have something to do with the fact that the app is un able to read the PDF file. More generally, I have also encountered random errors every now an then. Exception and error handling should be much better when ingesting files.