Open Vonewman opened 4 months ago
You could use the following libraries for parsing PDF & DOCX Documents:
1)PyPDF2: A pure-python library built as a PDF toolkit. It can be used to extract text, metadata, and other information from PDF files. 2)pdfplumber: A Python library that makes it easy to extract text and other information from PDF files. 3)docx2txt: A Python library that extracts the text from a .docx file.
Hello I'm in the process of fine-tuning a Large Language Model (LLM) for an NGO and I need to construct an instruction dataset from .pdf and .docx documents containing information in text.
The objective is to extract instructions from these documents and organize them into a structured dataset suitable for fine-tuning the LLM. This involves parsing .pdf and .docx files, extracting relevant text segments, and annotating them.
I'm seeking guidance and recommendations from the community on how to efficiently create this dataset. Specifically, I'm interested in:
Any advice, best practices, or resources you can provide to assist in this endeavor would be greatly appreciated. Thank you for your support!