How to create an instruction dataset from .pdf and .docx documents

Hello I'm in the process of fine-tuning a Large Language Model (LLM) for an NGO and I need to construct an instruction dataset from .pdf and .docx documents containing information in text.

The objective is to extract instructions from these documents and organize them into a structured dataset suitable for fine-tuning the LLM. This involves parsing .pdf and .docx files, extracting relevant text segments, and annotating them.

I'm seeking guidance and recommendations from the community on how to efficiently create this dataset. Specifically, I'm interested in:

Techniques and libraries for parsing .pdf and .docx documents in Python.
Strategies for extracting instructional content from the parsed documents while maintaining context and fidelity.
Approaches for annotating the extracted text segments as instructional content, including identifying key actions, steps, and contextual information.

Any advice, best practices, or resources you can provide to assist in this endeavor would be greatly appreciated. Thank you for your support!

mlabonne / llm-datasets

How to create an instruction dataset from .pdf and .docx documents #3