adam-ah commented 5 years ago

Bug classification

This is an enhancement request.

Non-export problems with BBT

An issue I keep running into when dealing with large number of references in the same area, is that the same author might be imported by slightly different names / versions on these references. Let's say these names across 3 papers would all point to the same author:

Dewe, Philip Dewe, Philip J. Dewe, P. J.

The problem is that when I generate my references using biblatex, this will cause weird artefacts to show up both in text and in references list as the generator will try to highlight that these are different people.

Sometimes it's easy to spot because a square bracket [] pair gets added in the references list, sometimes it is hard because the in-text citation will have a prefix.

The solution is to manually fix the entries, which is fine.

Would it be easy to try to identify these entries on Better Bibtex side? Just like finding duplicates in Zotero, so the user can go through the list and fix the references before exporting / using them?

Thanks :)

retorquere commented 5 years ago

This would be pretty hard for BBT to do. This should really be addressed in Zotero itself - this affects all referencing done with Zotero, not just bibtex.

adam-ah commented 5 years ago

I'm unsure why this was closed so readily. BBT does have access to all the data required to check this, and the algorithm itself is really not that complicated to simply flag that these authors might be the same across these papers.

As it is more an export than reference management issue, I'm not convinced that it must be developed on Zotero side at all - in fact, Zotero and its core plugins (eg. https://github.com/zotero/translators/issues/1667) doesn't seem to as actively addressing these kind of export issues as BBT does.

retorquere commented 5 years ago

It was closed readily because I think it's outside BBTs scope. I did not mean to come across as flippant about this, but it's just something that I think Zotero should fix (and in the issue you created on the Zotero repo, Dan agrees with this). On top of that, it is actually a more complicated problem for BBT than you might think.

BBTs exporters do not have access to all data. The exporters get only that subsection which is being exported, so you'd get differing reports depending on what slice of your library is being exported.
I would have to disable the cache for this to work (and trust me, you want the cache) because the caching mechanism relies on any given entry being exportable (including its quality report) independently from the state of any other reference.
You may be forgetting that names can be non-ASCII, and simple string similarity algorithms like soundex or levehnstein don't do great under various string normalizations (never mind Japanese names, where these algorithms don't make much sense anyway):

var a= 'café';          // caf\u00E9
var b= 'café';          // cafe\u0301
alert(a+' '+a.length);  // café 4
alert(b+' '+b.length);  // café 5
alert(a==b);            // false

Even if the check was simple, I'd have to do a full product comparison. This is going to be performance hungry on top of the issue that I'd have to disable the cache. BBT is already slow as it is. Which wouldn't be a disqualifying problem, but...
This is not an export issue. If your library says you have Dewe, Philip and Dewe, Philip J., BBT faithfully exports these. If you put the same references through the Zotero bibliography engine, you will see the same problems. It is absolutely a reference management issue, and one that is in no way specific to Bib(La)TeX. Every Zotero user is potentially affected by problems like the one you outline regardless of whether they use BibTeX. You mention yourself that it would be much like reference-duplicate checking, which is reference management, not export.
Zotero does not move as fast as BBT for very good reasons. My userbase is reasonably small, and the domain I move in fairly limited, so if I screw something up, the fallout is usually contained and easier to fix. Zotero has a much larger userbase so it can't afford the "move fast and break things" mentality that I can. BBTs purpose is also not to fix deficiencies in Zotero -- the purpose is to faithfully (where possible) reproduces the Bib(La)TeX equivalents to what you have in your library.

adam-ah commented 5 years ago

Thanks for the detailed feedback, I generally agree with what you are saying, with a few exceptions perhaps:

It's not an edit distance issue, but a prefix problem: the problems are not typos but slightly different ways the author's name (especially first name) is stored in the articles. I suspect converting to ascii and checking if they are prefix of each other would be sufficient (would need to double check with my actual issue-list, but so far all the issues I've seen were first name shorthand differences)
Sounds like we agree (or at least don't disagree) that this is a problem. I see your argument that it is not directly a BBT problem as BBT just exports the data as-is. On the other hand, Zotero is not very likely to fix this soon (if ever?). What do you think the right answer is? Should I try to create a separate add-in to check author names perhaps? Checking the references for these potential problems in the final PDF is not only tedious, but very error prone as well (I'm wondering why other users didn't bump into this problem...)

Thanks!

retorquere commented 5 years ago

Even with a prefix-only check (which again only make sense for "western" names), I'm still looking at a half-cartesian product. Anyhow, the major problem is that I'd have to disable the cache, regardless of how fast or slow the comparison would be, and disabling the cache would be pretty disastrous.

It looks to me Zotero is in a bit of a feature freeze as the focus on some major projects (like GDocs) and prep for the move to Electron, but not everything that is slow to get fixed in Zotero can fall on BBT just because BBT moves (sometimes recklessly) fast. I do think it's better handled with a separate plugin -- I've stripped a minimal plugin I made and put it here which would be a decent start I think. The best way to go about it (read: doesn't involve UI work, I hate UI work) would be to patch the Zotero report facility to include a warning about the names.

I regularly bump into the problem and it's an annoyance. The results are not actually wrong per se, but it's sloppy, and sometimes that's all it takes to get points docked or rejected. So yeah, I'd like to see this fixed, I just think BBT is not the right place to do it.

retorquere commented 5 years ago

Perhaps even a bare translator like https://github.com/retorquere/zotero-file-hierarchy would be enough. No monkey-patching the report facility, lots easier to do, could export html or csv or somesuch to outline the problematic names.

retorquere commented 5 years ago

converting to ascii and

Converting to ASCII is not a simple problem BTW. You run into the same normalization problems, and then there's Japanese, which has an ASCIIfication scheme but it's insanely complicated. I use libraries to do this (kuroshiro for Japanese and transliteration for anything else). kuroshiro is async though and cannot be used in export translators.

If you want to continue the discussion, we could either do it on gitter or on the Zotero forums. This issue is not the right place.

adam-ah commented 5 years ago

FYI https://github.com/adam-ah/zotero-detect-duplicate-authors

retorquere commented 5 years ago

Cool!

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

retorquere / zotero-better-bibtex

Identify similar / same authors before exporting #1067

Bug classification

Non-export problems with BBT