jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Find and delete duplicates #45

Closed bavarialogy closed 3 years ago

bavarialogy commented 3 years ago

Hi (again),

One cool feature would be to compare document contents and identify duplicates, maybe upon document consumption and/or in a separate workflow.

Once again, great kudos for the work, keep that up!

jonaswinkler commented 3 years ago

Duplicates are identified at consumption time. Only exact duplicates are rejected. This prevents you from accidentally consuming a document twice.

Identifying duplicates with text similarity algorithms is certainly possible, however, I've got quite a few documents in my store that are ALMOST the same, except for one or two numbers maybe. But they are certainly two different documents. Finding a good threshold for identifying duplicates that works for everyone is sadly near impossible :(

bavarialogy commented 3 years ago

Allright - thank you, Jonas. I'm doing fine with having exact duplicates rejected at consumption time for now as it shouldn't happen anymore with new documents in the future. However, it seems that paperless-ng double-consumed one document (which was in different files for like 6-7 times) several times. I'll just stick to manual deletion of those duplicates, it's just easier for a human to spot those. I'm just thinking forward to add aaaaall my PDFs and I'm sure there will be a lot of duplicates in there.

luiscachog commented 3 years ago

Hello!

The duplicate feature is not working on my implementation. I upload the same file but with different name, so is not an exact duplicate, but the MD5 checksum is the same, probably part of the identification should include a MD5 checksum verification?

Thanks!

jonaswinkler commented 3 years ago

@luiscachog

Paperless does not add files again if another file already exists in paperless with the same MD5 checksum. Filenames do not matter.

The errors aren't communicated very well yet, that will get fixed in the next version.

luiscachog commented 3 years ago

Got it, thanks for the clarification. I will verify the minimal differences between my two documents. I was thinking they are the same.