jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.38k stars 352 forks source link

[Feature Request] Split and merge documents #335

Open ghost opened 3 years ago

ghost commented 3 years ago

As a user I would like to merge different scan into one document.

Example: I scan the front and back side of an ID card, it uploads as different documents into paperless. I can merge the 2 documents into one.

jonaswinkler commented 3 years ago

Hi, welcome to GitHub!

As of right now, I don't have any plans to support editing PDF documents. If you really need that, it might be worth giving Papermerge a shot, they do have some editing tools over there. Although I don't know if they support document merging specifically.

This would also be a very big change, since

Not going to happen (anytime soon).

shamoon commented 3 years ago

Agree, feels kinda out of the scope of this app IMHO, and so many tools can do this, even native PDF / image viewers...

Philmo67 commented 3 years ago

Just curious : what tools are you using for merging/splitting/rotating pdf documents ? I tried PDFsam basic, or pdfarranger or even directly after scanning using NAPS2 but I don't find these tools user-friendly enough for these tasks.

jonaswinkler commented 3 years ago

PDFArranger gets the job done, and has everything I need. Apart from that, I usually use gscan2pdf for scanning, and discard unwanted pages with that, or scan multiple pages into a single document. No further editing needed, usually.

Matthias84 commented 3 years ago

Some request here. I feed scanned TIFFs from an ADF document scanner. Of course I could do a manual preprocessing outside of paperless, but for non tech users, it might be interesting to get an preprocessing-inbox where you could merge, remove, reorder pages, turn pages, ... before they are fully processed within paperless-ng? But yes, I see the point that it's a lot of work to support it for all the different (multi-page) formats ... :thinking:

Zocker1999NET commented 3 years ago

In #426 I had a similar idea (I did not found this already existing issue) about how to implement this:

How could this be implemented on the UI:

  1. Select the documents you want: Screenshot_20210124_134659
  2. Click on a "Combine" button

What happens in the background:

  1. Combine the original documents (not the archived versions!) for example using ImageMagisk: convert "$@" pdf:-
  2. Delete all old entries of the selected documents
  3. Reprocess the new document as it was simply placed into the consume directory

Known issues with this implementation:

  • The original source files maybe cannot be currently handled, so they may be lost. Possible workaround: Before combining the originals to a PDF document, pack them together into a zip/tar archive, store that as "original document" and enable paperless to work with zip/tar archives if possible
  • Will most likely not support formats not supported by ImageMagisk like Office documents, however should be able to combine JPEGs/PNGs/PDFs/TIFFs. Possible workaround: Before combining using ImageMagisk them, convert each file not supported by ImageMagisk to a PDF reusing current existing strategies.

The TAR/ZIP approach would allow paperless-ng to prevent losing the old original documents, while, as I see it, allow it "in theory relatively simply" (haven't seen the code of this project yet) to allow this with all existing documents.

jhass commented 3 years ago

I wonder where the need for merging generally comes from.

For me it's because my printer's scan to mail function can't put more than 4-5 pages into a single document. I wonder if it's comparable for most of you. If so, we might not need (much) UI for this or even change much about the one document == one file paradigm. Instead we could allow defining rules for merging at consumption time.

For mail this is "easy", we could have a mail rule to merge all attachments in a mail into one document for example (convert to PDF if needed, sort by name, then merge PDFs).

For manual upload it could be a checkbox or a separate button "upload selected files as one".

API wise that could be a new endpoint, "upload multiple as one", to be integrated into any frontends that want to support this.

The most tricky bit is for the folder drop, since there's no rule system for that yet as far as I noticed. One could imagine something alike the mail filtering system though based on filenames, match all files with pattern, do action on that set of files. The only action at first would be merge of course.

So, to summarize I wonder if "merge at consumption" solves the needs of most people here already.

Zocker1999NET commented 3 years ago

@jhass Your ideas on how to implement this feature seems great.

To answer your question, my printer may has such a feature, but because I'm currently scanning over 600 documents to store even older documents on the computer, I decided not to use such features because I wanted to scan these amounts of pages using a feed scanner without thinking about them in the first place. I wanted to do the sorting/merging of documents only digitally but create an index over my offline documents by sorting them by an incrementing scan id they gain after scanning. I think this is much easier in my case.

jhass commented 3 years ago

Not too different story here, just maybe a little less to go through and I'm also throwing out stuff where I feel fine at retaining only the digital copy, so going by the recommended ASN system :) I'm just running paperless-ng on a server so using the mail function of my feed scanner rather than some scan tool on a PC is easiest. Now my pain is that I have multiple documents for the same ASN :sweat_smile: , othewise I'm not too worried about, it's indeed easy to find the "other parts" by date or title.

jonaswinkler commented 3 years ago

Chiming in here and sharing some further ideas and comments. I'm pretty busy right now and don't have all that much time except for critical stuff.

For mail this is "easy", we could have a mail rule to merge all attachments in a mail into one document for example (convert to PDF if needed, sort by name, then merge PDFs).

This would certainly be possible. However, does your scanner actually support sending multiple scanned files in one mail? Also, I'd like to have the merging logic available to all users, not just users who use the mail functionality.

The most tricky bit is for the folder drop, since there's no rule system for that yet as far as I noticed. One could imagine something alike the mail filtering system though based on filenames, match all files with pattern, do action on that set of files. The only action at first would be merge of course.

This is in fact the most tricky part. Once paperless detects new files in the consumption folder, it sends them to the task queue for processing immediately. How does paperless detect when the last document of a batch has arrived? I don't think there's a good solution here.

If we do this, I'd like to have the merging functionality available to everyone, and the consumption folder is still the most commonly used way to upload documents.


Therefore, some ideas on what could actually work for everyone, given the current architecture.

After that works, we could also think about adding support for selecting individual pages from the selected documents. (This is something I'd find useful as well, since I've got lots of documents with empty pages that my scanner detected as not empty)

The most critical part is making the backend work, so that should be the focus.

How does that sound? This all is also very isolated functionality and can be added without affecting anything else. If someone wants to take a stab at that, I can give some more detailed instructions on how to do it.

jhass commented 3 years ago

However, does your scanner actually support sending multiple scanned files in one mail? Also, I'd like to have the merging logic available to all users, not just users who use the mail functionality.

No it does not, it doing the opposite, splitting the document into multiple PDFs inside one mail (I think it's one mail, I actually never checked :sweat_smile:) if it gets too big.

Yes, a merging tool that just creates a new document sounds like a great idea :+1: :)

As a third alternative I think something like a meta-document which groups several documents in a defined order but keeping the members untouched and their own entries could also already help a lot of usecases and might be a little bit less effort.

jonaswinkler commented 3 years ago

and might be a little bit less effort.

We'd still need some UI to define these meta documents, which is about the same as the one for a merge tool. We also need support in the back end for that, documents now have ordered child documents? Also, many components of paperless have to take this new data structure into account (search index should not return documents that are part of a meta document, import+export, metadata matching should use the concatenated content of all documents in a meta document, ...)

Compared to building a feature that uses already existing data structures and abides to already defined contracts (and in doing so is compatible with all existing features), this is actually a lot more work.

And then there's also the test suite. Changing features requires changing associated test cases. Adding a new isolated feature just requires new test cases for that feature.

jhass commented 3 years ago

I didn't mean it to be that invasive, child documents could appear normally still and meta documents could not appear in full text search etc, child documents would "just" provide a quick link to go to the meta document they're part of.

I felt building the merge background task could potentially prove to be quite the rabbit hole šŸ˜…

jonaswinkler commented 3 years ago

I didn't mean it to be that invasive, child documents could appear normally still and meta documents could not appear in full text search etc, child documents would "just" provide a quick link to go to the meta document they're part of.

I want to do things properly :)

I felt building the merge background task could potentially prove to be quite the rabbit hole šŸ˜…

Actually not, use pikepdf to produce a new pdf document, submit that to the consumer just as we do with other new documents, and optionally delete some documents when done. The consumer will take care of the rest.

The actual merging and editing is very straight forward (https://pikepdf.readthedocs.io/en/latest/topics/pages.html).

jonaswinkler commented 3 years ago

I just put a new API endpoint together, and the actual merging process on the server side is straight forward. Reordering documents, keeping only selected pages, that's all simple. The merged document will appear as a new document to paperless, with notifications and all that.

Now I need to get this implemented properly and figure out how the UI is supposed to work.

shamoon commented 3 years ago

Very cool! Didnā€™t even realize you were actively working on this. Let me know if / when / where I can help, have some UI ideas

jonaswinkler commented 3 years ago

Well, I don't exactly communicate what I'm actively working on, that's true.

If you want to work out a UI for this, go for it.

I've also got some UI ideas, not sure if they align with yours, here goes.

These are just ideas. If you got better ideas, go for it, while keeping the following in mind:

The API will essentially accept an ordered list document of ids, and for each document id an optional page range. I don't have the details down yet. As long as the UI is able to provide that, we're good. It will be possible to specify the same document twice, in case you want to add pages from a document somewhere in the middle of another document.

The API will also have an option to download the resulting document as a preview without actually adding it to paperless. And some flags to optionally delete source documents on success.

shamoon commented 3 years ago

Yea thats pretty similar to what I imagined. And I agree as this will be a not-every day and probably even not every-user kinda tool the button shouldnt be too prominent / take up too much space, maybe inside a menu or something. And yep exactly what I was thinking about getting there from document detail or bulk edit, and it opens a modal with the UI.

As for the actual UI, definitely agree on visual drag + drop, preview does sound cool too. And then when the user is done do they hit "Save" does it create a new document? And what about metadata? Im sure we'll have to figure out lots of stuff once we dig in. Mobile might be a challenge, etc.

jonaswinkler commented 3 years ago

and it opens a modal with the UI.

Not necessarily a modal, I think this should be a full page view. You may want to go from the merge tool to the list again, and add more documents.

And then when the user is done do they hit "Save" does it create a new document?

It will create a new document. Maybe a checkbox that will cause the source files to be removed on success.

And what about metadata?

Options for either keeping info from the first document (which should be most representative for the resulting file), or run it through the matching algorithms again.

Mobile might be a challenge, etc.

It's okay to have certain functionality not available on mobile.

shamoon commented 3 years ago

Hmm, just now this makes me think about whether it will be frustrating if the actual merge UI has no way to add documents, like a ā€œpickerā€ of some kind. Like if you added 2 docs but realized you need a third youā€™d have to go back to the list and find the other, hightlight it and add it. A little odd. Then again maybe people will mostly be merging two docs so itā€™s no big deal?

Just kinda asking / thinking out loud. I personally havenā€™t needed this so Iā€™m trying to put myself in the mindset of a user of this

jonaswinkler commented 3 years ago

Hmm, just now this makes me think about whether it will be frustrating if the actual merge UI has no way to add documents, like a ā€œpickerā€ of some kind.

The picker would essentially be a list view (with filtering) as well, or something similar, and we already have that.

shamoon commented 3 years ago

The picker would essentially be a list view (with filtering) as well, or something similar, and we already have that.

Yea.

Should be fun challenge, LMK when I can start playing with it.

ffleischer commented 3 years ago

While you're currently working on this topic, would it make sense to directly consider some kind of Staple functionality?

UseCase:
I've got a small mobile document scanner (Doxie GO) which can scan multiple pages but only one-sided. So I need to scan the other sides separately. Their Windows companion app has this Staple functionality.

It generates from 2 documents (D1 & D2) with multiple pages (D1P1, D1P2, .... & D2P1, D2P2, ....) a new document with (D1P1, D2P1, D1P2, D2P2,.....). This mitigates the single side scan a bit.

jonaswinkler commented 3 years ago

While you're currently working on this topic, would it make sense to directly consider some kind of Staple functionality?

UseCase: I've got a small mobile document scanner (Doxie GO) which can scan multiple pages but only one-sided. So I need to scan the other sides separately. Their Windows companion app has this Staple functionality.

It generates from 2 documents (D1 & D2) with multiple pages (D1P1, D1P2, .... & D2P1, D2P2, ....) a new document with (D1P1, D2P1, D1P2, D2P2,.....). This mitigates the single side scan a bit.

Hey, thanks for the input. So far I've only done the back end part (well, "done" as in it works on a selected few documents), and the implementation would certainly support this. It's really up to the front end to make this work the way you want.

The back end accepts something I called a split+merge plan, and it's essentially a data structure that describes how to create one or more target documents from one or more source documents, or specific pages of certain source documents. It's totally possible to ask it to interlace two documents.

jonaswinkler commented 3 years ago

@shamoon

Alright. If you want to try to make a good UI for this, I've got the server side ready. The front end already has a very crude layout of what I've been thinking about.

Branch is feature-merge-tool.

I'd really like the tool to be able to

Other than that, no particular constraints. Some extras that sound useful:

The server has an endpoint /api/split_merge/ that:

https://github.com/jonaswinkler/paperless-ng/blob/98b26b8e16a290200a483ff414fdc97050343b9c/src/documents/serialisers.py#L501

The API isn't set into stone yet - if you think something should change, mention me. For example - instead of accepting a string for page ranges ("1-3,7,20-25" etc), I could change that to accept an array of integers as well. Or if you need the total number of pages for any document, I could add that to the metadata as well.

shamoon commented 3 years ago

Ok cool! Obviously a big job, I already started playing with it but going to have to find some serious time to dedicate. A couple things:

Im sure I will have questions / need help, etc. soon but will keep you posted! =)

documents.zip

Edit: here's log output: https://gist.github.com/shamoon/6d2b2806dd98aa9ed315940b465406f7

jonaswinkler commented 3 years ago
  • At the moment it only accepts 2 documents, correct? (throws errors with > 2)

No, it's supposed to work with any number of documents.

  • If I try to merge the two documents below I get an 500 error from the API if the file-sample_150kB file is first (already have a crude drag-n-drop working šŸ˜ƒ), if I put the other one first it works fine. Any ideas?

I'm copying PDF metadata into the merged file, and that's not working for your files. Thank's for the files.

shamoon commented 3 years ago
  • At the moment it only accepts 2 documents, correct? (throws errors with > 2)

No, it's supposed to work with any number of documents.

  • If I try to merge the two documents below I get an 500 error from the API if the file-sample_150kB file is first (already have a crude drag-n-drop working šŸ˜ƒ), if I put the other one first it works fine. Any ideas?

I'm copying PDF metadata into the merged file, and that's not working for your files. Thank's for the files.

Oh I think it was the same issue as the other thing where there are some PDFs that are problematic, I can merge > 2 documents depending on which I choose. Hopefully the example docs help.

BTW, this supports only PDF files, right? Will need some logic to alert user of that, etc.

jonaswinkler commented 3 years ago

Comment this line to make it work for now:

https://github.com/jonaswinkler/paperless-ng/blob/98b26b8e16a290200a483ff414fdc97050343b9c/src/documents/merge.py#L161

BTW, this supports only PDF files, right? Will need some logic to alert user of that, etc.

shamoon commented 3 years ago

Ok thanks.

For the documents array in SplitMergeService do you have a thought about the best data structure to use to incorporate pages? I think we will need to support pages (because if we dont people will immediately ask for it šŸ˜€)

jonaswinkler commented 3 years ago

The best approach would probably be to use some inheritance, at least I'd play with that a bit:

Have that list contain items which are either:

And optionally (we can extend this later on)

The contents of the documents list is essentially what the user sees on the page. Depending on what a certain item is, display a corresponding visual element. When the user hits go, you'd construct the split merge plan based on what is in that list.

Just some ideas.

shamoon commented 3 years ago

@jonaswinkler OK I think Im at the point where I need another set of eyes, when you have a second look at https://github.com/shamoon/paperless-ng/tree/feature/merge-tool-ui . Obviously, this is not doneā€”not even close, but I want to see if youre feeling like its on the right track. Notes:

So really just looking for some feedback / thoughts on the above and in general before I keep going down the rabbit hole...

jonaswinkler commented 3 years ago

Please make a PR, it's easier for me to work with that.

shamoon commented 3 years ago

Gotcha. Is it still helpful if I leave it as a draft?

On Mar 17, 2021, at 3:03 PM, Jonas Winkler @.***> wrote:

ļ»æ Please make a PR, it's easier for me to work with that.

ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jonaswinkler commented 3 years ago

Yes.

umoenks commented 3 years ago

Excellent that this is being worked on! šŸ˜ƒ If you need someone to test things from a user POV, please let me know. I am currently processing a lot of documents (I'm a quite new fan of paperless and preparing my taxes šŸ˜), and thus have a lot to play around with. Cheers!

jonaswinkler commented 3 years ago

Some feedback. Didn't look at the code yet.

Regarding split and merge as separate operations:

image

Regarding the layout of the document cards:

shamoon commented 3 years ago

Great this is helpful thanks! Sounds like itā€™s on the right track so Iā€™ll press on!

  • Drag & Drop is very nice. I'd probably reduce the draggable area to just the thumbnail on the left.

yea I have some ideas for this

  • I'm not exactly sure what the use case for duplicating pages is. I've seen the error message in the server, but I'm not sure how to fix that.

My thought was if someone wanted to build document A (pgs 1-3) + document B + document A (pgs 4-5) they would start by adding those 2 docs to the editor and then click ā€œduplicateā€ document A and drag one copy after B (and change the pages). This button could also be renamed ā€œSplitā€ but obviously I thought that would be confusing. If this button isnā€™t there at all the user would have to add document A again via the document chooser. Iā€™m working off made up use cases, so maybe the situation Iā€™m describing is rare enough that itā€™s not worth a ā€œduplicateā€ button at all? On the other hand, especially once the cards are more visually solved, I think there will be plenty of room for this button so not like thereā€™s such a cost to keeping it. We would have to figure out errors of course. What do you think?

  • I really like the page selector! Not sure how that will perform with large documents though.

Cool. It uses the same pdf viewer so should scale ok I think.

Regarding split and merge as separate operations:

  • When moving from merge to split, selection beyond the first document is lost. This is somewhat concerning.

This is the most unresolved thing at the moment. I think there are two ideas for ā€œSplitā€. One is like above where youā€™re going to use two different parts (pages) of a document in one output file. The other is if you want to split a single document into two different files. Right? Iā€™m understanding that correctly? If so, in my mind they are kinda separate operations which is why I went with different ā€œmodesā€. And presumably weā€™d only allow one document to split at a time so my plan was to add some kind of warning for when someone has more than one doc and they switch. And yea, visually Iā€™ve barely touched ā€œSplitā€ mode so yea thereā€™s lots to do there including clear indications of where the split is, showing two previews, etc etc. But before that, am I misunderstanding anything here?

Regarding the layout of the document cards:

Agree, planning to spend more time on this, mostly just a working concept now!

jonaswinkler commented 3 years ago

My thought was if someone wanted to build document A (pgs 1-3) + document B + document A (pgs 4-5) they would start by adding those 2 docs to the editor and then click ā€œduplicateā€ document A and drag one copy after B (and change the pages). This button could also be renamed ā€œSplitā€ but obviously I thought that would be confusing. If this button isnā€™t there at all the user would have to add document A again via the document chooser. Iā€™m working off made up use cases, so maybe the situation Iā€™m describing is rare enough that itā€™s not worth a ā€œduplicateā€ button at all? On the other hand, especially once the cards are more visually solved, I think there will be plenty of room for this button so not like thereā€™s such a cost to keeping it. We would have to figure out errors of course. What do you think?

Well, I can't think of a case where I'd want documents containing duplicate pages in paperless.

This is the most unresolved thing at the moment. I think there are two ideas for ā€œSplitā€. One is like above where youā€™re going to use two different parts (pages) of a document in one output file. The other is if you want to split a single document into two different files. Right? Iā€™m understanding that correctly?

When I think about "split", I want to separate one document into multiple. Say I've scanned in 3 documents at once, I'd then want to say "split this document on page 3, and on page 6".

If so, in my mind they are kinda separate operations

They don't have to be. With the separators as pictured above, you can do

without the need for some modes.

shamoon commented 3 years ago

Ok, this gives me ideas to play with. Lots to work onā€”will let you know when thereā€™s more to look at!

shamoon commented 3 years ago

OK! Lots of progress, @jonaswinkler please check out the PR again when you have a moment, looking forward to feedback about current state of things.

And a video cause why not:

https://user-images.githubusercontent.com/4887959/112542343-b1476880-8d71-11eb-8768-a444bcda8633.mov

jonaswinkler commented 3 years ago

That's really good and exactly what I had in mind. How about splitting a document into three different parts?

shamoon commented 3 years ago

Yessir! (video below) Bonus: added page numbers

Glad you like it šŸ˜Ž but lemme know what issues you discover...

https://user-images.githubusercontent.com/4887959/112548874-26b73700-8d7a-11eb-91b2-fc6d2dbb6b7f.mov

umoenks commented 3 years ago

Looking forward to seeing this integrated and getting my hands on it! From what I have seen from the videos: great job, @shamoon! šŸ‘

jonaswinkler commented 3 years ago

Alright, first of all some UI things :)

image

Looks weird.

image

Nitpick: below "document end".


Don't send empty lists when the separator is the first / last part:

image

image


Some functional things:


jonaswinkler commented 3 years ago

Also:

jonaswinkler commented 3 years ago

image

Please get the css straight. These tabs are not like the others :) Or is that intentional?

shamoon commented 3 years ago
Screen Shot 2021-04-02 at 7 24 39 AM Screen Shot 2021-04-02 at 7 25 22 AM
jonaswinkler commented 3 years ago
  • What browser are you using btw?

Chromium.

  • When the separator is the first or last item it should remove it. I cant reproduce the bug youre seeing where its sending [], are there specific steps to recreate this? Thats what lines 31-35 in split-merge.service.ts should prevent

I manually dragged a separator to the end after splitting.

  • Ill see about document chooser using the view, its pretty basic right now

I think it's okay that way.

shamoon commented 3 years ago

Ok these small things should be addressed, I'll tackle the rest when I have some more time later and reply directly to your code review. Thanks