marianna13 / doc2dataset

A tool to extract text (and images) from documents (like PDFs)
MIT License
2 stars 1 forks source link
big-data dataset document image interleaved multimodal text

doc2dataset

Open In Colab

Easily extract text (and images) from a bunch of pdf files (while preserving the original text formatting)

Install

pip install git+https://github.com/marianna13/doc2dataset.git

Python examples

Checkout these examples to use doc2dataset:

API

This module exposes a single function pdf_extractor which takes the same arguments as the command line tool:

Output examples

sample_output.md

For development

Setup a virtualenv:

python3 -m venv .env
source .env/bin/activate
pip install -e .

to run tests:

pip install -r requirements-test.txt

then

make lint
make test

You can use make black to reformat the code