retorquere / zotero-storage-scanner

A Zotero plugin to remove the broken & duplicate attachment link of the bibliography
530 stars 19 forks source link

What criterial on "#duplicate_attachments" ? #12

Open z5tron opened 6 years ago

z5tron commented 6 years ago

I have lots of items are labelled as "#duplicate_attachments", but they are under different zoetro_storage folder, and with different size, different name(title inside the middle panel), different physical file name and modification time.

Is this a feature or bug ?

Thanks.

retorquere commented 6 years ago

I'm not sure what is meant by different zotero_storage meant (different profiles? different libraries? different folders?), but the logic right now is that a file is counted as a duplicate if there are two or more attachments of the same type (doc, docx, pdf, whatnot) under the same reference item.

z5tron commented 6 years ago

I meant the physical folder named "zotero/storage/". But you have explained my questions. Still there is problem: I have a book item with "Google Books Link" (URL link), a epub and mobi, three attachments under this book in total. Each with different file type. It is marked as "#duplicate_attachments".

retorquere commented 5 years ago

I'd have to look at a copy of your database to tell why that happens, I don't have an immediate explanation.

JsHuang commented 3 years ago

So, if one item has 2 or more attachments with the same file type, they will be treat as duplicates?

retorquere commented 3 years ago

Yes.

bnlawrence commented 2 years ago

Just a comment to think about: When a reference has supplementary material, I often end up with multiple PDF attachments for one reference ... would it be possible to handle this case with file size rather than file type? (This is not frequent enough to be a big deal, for me at least ... but I'm just throwing it out there in case it matters for others).

retorquere commented 2 years ago

That wouldn't really help for the cases I made this for. I often had merged duplicates where I acquired substantially similar, but not bit-for-bit equal, versions of the same article.

phirsch commented 2 years ago

To me, "#duplicate_attachments" suggests that the flagged items would contain the same attachment multiple times (in particular, my expectation given this wording was that the files would be identical, or at least have identical hashes under something like md5 or stronger). Would it be feasible to rename this tag to something more explanatory / less prone to misunderstanding, like "#multiple_attachments_of_same_type"?

endolith commented 2 years ago

Yes, this tag says duplicate_attachments but this is false, they are just attachments of the same type. duplicate_attachments would mean they are byte-for-byte identical (which many are from merging items).

retorquere commented 2 years ago

Feel free to submit a PR. Personally I'd consider it a duplicate if the article text is substantially the same.