mlabonne / llm-datasets

High-quality datasets, tools, and concepts for LLM fine-tuning.
1.66k stars 150 forks source link

How to create an instruction dataset from .pdf and .docx documents #3

Open Vonewman opened 4 months ago

Vonewman commented 4 months ago

Hello I'm in the process of fine-tuning a Large Language Model (LLM) for an NGO and I need to construct an instruction dataset from .pdf and .docx documents containing information in text.

The objective is to extract instructions from these documents and organize them into a structured dataset suitable for fine-tuning the LLM. This involves parsing .pdf and .docx files, extracting relevant text segments, and annotating them.

I'm seeking guidance and recommendations from the community on how to efficiently create this dataset. Specifically, I'm interested in:

  1. Techniques and libraries for parsing .pdf and .docx documents in Python.
  2. Strategies for extracting instructional content from the parsed documents while maintaining context and fidelity.
  3. Approaches for annotating the extracted text segments as instructional content, including identifying key actions, steps, and contextual information.

Any advice, best practices, or resources you can provide to assist in this endeavor would be greatly appreciated. Thank you for your support!

ParagEkbote commented 3 weeks ago

You could use the following libraries for parsing PDF & DOCX Documents:

1)PyPDF2: A pure-python library built as a PDF toolkit. It can be used to extract text, metadata, and other information from PDF files. 2)pdfplumber: A Python library that makes it easy to extract text and other information from PDF files. 3)docx2txt: A Python library that extracts the text from a .docx file.