jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Split single PDF into multiple ones using a delimiter page #317

Closed denvercoder21 closed 3 years ago

denvercoder21 commented 3 years ago

I was in the course of implementing a tool for archiving and auto-uploading my documents before I stumbled across paperless, which is much more powerful and mature than my little pet project.

However, I can feed a stack of paper consisting of multiple documents (up to my device's 50 page limit) to my scanner which gets uploaded to the inbox folder of my server as a single file (of course). My server application then searches through the pages of the file for a certain page containing a QR code. This position is used to cut the large files into separate documents again.

Example: (numbers are page numbers)

1 doc 1
2 doc 1
3 doc 1
4 QR page
5 doc 2
6 QR page
7 doc 3
8 doc 3

Will be cut into:

1 doc 1
2 doc 1
3 doc 1
1 doc 2
1 doc 3
2 doc 3

Support for this by paperless would be cool! My code is here: https://github.com/denvercoder21/split-pdf/blob/main/split.py

I haven't taken the time yet to look through paperless' code, so I can't tell yet whether I'm confident creating a PR myself.

Let me know what you think!

jonaswinkler commented 3 years ago

You scan the same qr code page in between documents, right? Sounds useful to me.

However, its pretty hard to get that functionality into paperless. Let me elaborate on how the consumption pipeline works real quick:

The issue is as follows: The only place where we're sure we're dealing with PDF documents (and not text files / office documents) is inside the PDF parser. However, at that place, we're limited to producing exactly one document. Changing that requires many changes to how the consumption pipeline works, invalidates many test cases, etc. The key file is documents/consumer.py, and the method is try_consume_file.

Adding that to the consumption folder watcher (management/document_consumer.py, I need to rename that) is possible, but sounds like a rather special feature.


I've got a better idea:

  1. Make this a stand-alone script, that continuously watches a specified folder (just as paperless does). It checks each file for its type, and if it's a PDF, performs the slicing operations, and moves the resulting files into the consumption folder of paperless. If it's not a PDF, it simply moves the file into the consumption directory.

  2. Make a docker image, so that people can add that easily to their compose files like this:

services:

  webserver:
    image: jonaswinkler/paperless-ng:latest
    ...
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - consume-internal:/usr/src/paperless/consume

  qrcode-splitter:
    image: someone/paperless-qrcode-splitter
    restart: unless-stopped
    volumes:
      - consume-internal:/output
      - ./consume:/input
    environment:
      QR_MAGIC_CONTENT: The decoded content of the qr code used to split documents

volumes:
  consume-internal:
  data:
  media:

The key here is that both containers would "communicate" through that internal consumption folder.

I'm not doing that though, but I can give some directions and hints if someone wants to take a stab at it.

mandomal commented 3 years ago

What a coincidence. I just made some code to do just this today. However, my delimiter is just a blank page. So far it works fine, but my code needs some improvement. I don't want to make this a ongoing project for myself so once I've made it usable for myself I'll upload and link the project for anyone to fork and improve upon.