retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5k stars 277 forks source link

Identify similar / same authors before exporting #1067

Closed adam-ah closed 5 years ago

adam-ah commented 5 years ago

Bug classification

This is an enhancement request.

Non-export problems with BBT

An issue I keep running into when dealing with large number of references in the same area, is that the same author might be imported by slightly different names / versions on these references. Let's say these names across 3 papers would all point to the same author:

Dewe, Philip Dewe, Philip J. Dewe, P. J.

The problem is that when I generate my references using biblatex, this will cause weird artefacts to show up both in text and in references list as the generator will try to highlight that these are different people.

Sometimes it's easy to spot because a square bracket [] pair gets added in the references list, sometimes it is hard because the in-text citation will have a prefix.

The solution is to manually fix the entries, which is fine.

Would it be easy to try to identify these entries on Better Bibtex side? Just like finding duplicates in Zotero, so the user can go through the list and fix the references before exporting / using them?

Thanks :)

retorquere commented 5 years ago

This would be pretty hard for BBT to do. This should really be addressed in Zotero itself - this affects all referencing done with Zotero, not just bibtex.

adam-ah commented 5 years ago

I'm unsure why this was closed so readily. BBT does have access to all the data required to check this, and the algorithm itself is really not that complicated to simply flag that these authors might be the same across these papers.

As it is more an export than reference management issue, I'm not convinced that it must be developed on Zotero side at all - in fact, Zotero and its core plugins (eg. https://github.com/zotero/translators/issues/1667) doesn't seem to as actively addressing these kind of export issues as BBT does.

retorquere commented 5 years ago

It was closed readily because I think it's outside BBTs scope. I did not mean to come across as flippant about this, but it's just something that I think Zotero should fix (and in the issue you created on the Zotero repo, Dan agrees with this). On top of that, it is actually a more complicated problem for BBT than you might think.

var a= 'café';          // caf\u00E9
var b= 'café';          // cafe\u0301
alert(a+' '+a.length);  // café 4
alert(b+' '+b.length);  // café 5
alert(a==b);            // false
adam-ah commented 5 years ago

Thanks for the detailed feedback, I generally agree with what you are saying, with a few exceptions perhaps:

Thanks!

retorquere commented 5 years ago

Even with a prefix-only check (which again only make sense for "western" names), I'm still looking at a half-cartesian product. Anyhow, the major problem is that I'd have to disable the cache, regardless of how fast or slow the comparison would be, and disabling the cache would be pretty disastrous.

It looks to me Zotero is in a bit of a feature freeze as the focus on some major projects (like GDocs) and prep for the move to Electron, but not everything that is slow to get fixed in Zotero can fall on BBT just because BBT moves (sometimes recklessly) fast. I do think it's better handled with a separate plugin -- I've stripped a minimal plugin I made and put it here which would be a decent start I think. The best way to go about it (read: doesn't involve UI work, I hate UI work) would be to patch the Zotero report facility to include a warning about the names.

I regularly bump into the problem and it's an annoyance. The results are not actually wrong per se, but it's sloppy, and sometimes that's all it takes to get points docked or rejected. So yeah, I'd like to see this fixed, I just think BBT is not the right place to do it.

retorquere commented 5 years ago

Perhaps even a bare translator like https://github.com/retorquere/zotero-file-hierarchy would be enough. No monkey-patching the report facility, lots easier to do, could export html or csv or somesuch to outline the problematic names.

retorquere commented 5 years ago

converting to ascii and

Converting to ASCII is not a simple problem BTW. You run into the same normalization problems, and then there's Japanese, which has an ASCIIfication scheme but it's insanely complicated. I use libraries to do this (kuroshiro for Japanese and transliteration for anything else). kuroshiro is async though and cannot be used in export translators.

If you want to continue the discussion, we could either do it on gitter or on the Zotero forums. This issue is not the right place.

adam-ah commented 5 years ago

FYI https://github.com/adam-ah/zotero-detect-duplicate-authors

retorquere commented 5 years ago

Cool!

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.