globbestael / DedupEndNote

Deduplication of EndNote RIS files
http://dedupendnote.nl
Apache License 2.0
1 stars 1 forks source link
deduplication endnote ris-file

DedupEndNote

Deduplication of EndNote RIS files:

DedupEndNote is available at http://dedupendnote.nl:9777

Actions

Building your own version

DedupEndNote is a Java web application (Java 17, Spring Boot 2.7, fat jar). It can be started locally with:

    java -jar DedupEndNote-[VERSION].jar

and the application will be available at

    http://localhost:9777

Why DedupEndNote?

Deduplication in EndNote misses many duplicate records. Building and maintaining a Journals List within Endnote can partly solve this problem, but there remain lots of cases where EndNote is too unforgiving when comparing records. Some bibliographic databases offer deduplication for their own databases (OVID: Medline and EMBASE), but this does not help PubMed, Cochrane or Web of Science users.

DedupEndNote deduplicates an EndNote RIS file and writes a new RIS file with the unique records, which can be imported into a new EndNote database. It is more forgiving than EndNote itself when comparing records, but tests have shown that it identifies many more duplicates (see below under "Performance").

The program has been tested on EndNote databases with records from:

The program has been tested with files with up to 50.000 records.

What does DedupEndNote do?

1. Deduplicate

Each pair of records is compared in 5 different ways. The general rule is:

ComparisonResultAction
1 ... 5 YES go to next comparison if present,
else mark the records as duplicates
(insufficient data for comparison)
NO stop comparisons for this pair of record

The following comparisons are used (in this order, chosen for performance reasons):

  1. Publication year: Are they at most 1 year apart?
    • Prepocessing: publication years before 1900 are removed (see insufficient data)
    • Insufficient data: Records without a publication year are compared to all records unless they have been identified as a duplicate.
  2. Starting page or DOI: Are they the same?
    If the starting pages are different or one or both are absent, the DOIs are compared.
    • Preprocessing: Article number is treated as a starting page if starting page itself is empty or contains "-".
    • Preprocessing: Starting pages are compared only for number: "S123" and "123" are considered the same.
    • Preprocessing: In DOIs 'http://dx.doi.org/', 'http://doi.org/', ... are left out. URL- and HTML-encoded DOIs are decoded ('10.1002/(SICI)1098-1063(1998)8:6&lt;627::AID-HIPO5&gt;3.0.CO;2-X' becomes '10.1002/(SICI)1098-1063(1998)8:6<627::AID-HIPO5>3.0.CO;2-X'). DOIs are lowercased.
    • Insufficient data: If one or both DOIs are missing and one or both of the starting pages are missing, the answer is YES. This is important because of PubMed ahead of print publications.
  3. Authors: Is the Jaro-Winkler similarity of the authors > 0.67?
    • Preprocessing: The author "Anonymous," is treated as no author.
    • Preprocessing: Group author names are removed. "Author" names which contain "consortium", "grp", "group", "nct" or "study" are considered group author names.
    • Preprocessing: First names are reduced to initials ("Moorthy, Ranjith K." to "Moorthy, R. K.").
    • Preprocessing: All authors from each record are joined by "; ".
    • Insufficient data: If one or both records have no authors, the answer is YES (except if one of the records is a reply (see below) and one of the records has no starting page or DOI).
  4. Title: Is the Jaro-Winkler similarity of (one of) the normalized titles > 0.9?
    The fields Original publication (OP), Short Title (ST), Title (TI) and sometimes Book section (T3, see below) are treated as titles. Because the Jaro-Winkler similarity algorithm puts a heavy penalty on differences at the beginning of a string, the normalized titles are also reversed.
    • Preprocessing: The titles are normalized (converted to lower case, text between "<...>" removed, all characters which are not letters or numbers are replaced by a space character, ...).
    • Insufficient data: If one of the records is a reply (see below), the titles are not compared / the answer is YES (but the Jaro-Winkler similarity of the authors should be > 0.75 and the comparison between the journals is more strict).

      Reply: a publication is considered a reply if the title (field TI) contains "reply", or contains "author(...)respon(...)", or is nothing but "response" (all case insensitive).

T3 field: Especially EMBASE (OVID) uses this field for (1) Conference title (majority of cases), (2) an alternative journal title, and (3) original (non English) title. Case 1 (identified as containing a number or "Annual", "Conference", "Congress", "Meeting" or "Society") is skipped. All other T3 fields are treated as Journals and as titles.

  1. ISSN or Journal: Are they the same (ISSN) or similar (Journal)?
    The fields Journal / Book Title (T2), Alternate Journal (J2) and sometimes Book section (T3, see below) are treated as journals, ISBNs as ISSNs. All ISSns and journal titles (including abbreviations) in the records are used. Abbreviated and full journal titles are compared in a sensible way (see examples below). If the ISSns are different or one or both records have no ISSN, the journals are compared.
    • Preprocessing: ISSNs are normalized (dashes are removed, lowercased). For ISBN-10 the first 9 digits are used, for ISBN-13 the 9 digits starting at position 4.
    • Preprocessing: Journal titles of the form "Zhonghua wai ke za zhi [Chinese journal of surgery]" or "Zhonghua wei chang wai ke za zhi = Chinese journal of gastrointestinal surgery" or "The Canadian Journal of Neurological Sciences / Le Journal Canadien Des Sciences Neurologiques" are split into 2 journal titles.
    • Preprocessing: the journal titles are normalized (hyphens, dots and apostrophes are replaced with space, end part between round or square brackets is removed, initial article is removed, ...).

If two records get 5 YES answers, they are considered duplicates. Only the first record of a set of duplicate records is copied to the output file.

2. Enrich the records

When writing the output file (except in Mark Mode), the following fields can be changed:

The output file is a new RIS file which can be imported into a new EndNote database.

DedupEndNote is slower than EndNote in deduplicating records because its comparisons are more time consuming. EndNote can deduplicate a EndNote database of ca. 15,000 records in less dan 5 seconds. DedupEndNote needs around 20 seconds to deduplicate the export file in RIS format (115MB).

Performance

Data are from:

Name Tool True pos False neg Sensitivity True neg False pos Specificity Accuracy
SRA: Cytology screening
(1856 rec)
EndNote X9 885 518 63.1% 452 1 99.8% 72.0%
SRA-DM 1265 139 90.1% 452 0 100.0% 92.5%
DedupEndNote 1359 61 95.7% 436 0 100.0% 96.8%
SRA: Haematology (1415 rec) EndNote 159 87 64.6% 1165 4 99.7% 93.6%
SRA-DM 208 38 84.6% 1169 0 100.0% 97.3%
DedupEndNote 222 14 94.1% 1179 0 100.0% 99.0%
SRA: Respiratory
(1988 rec)
EndNote X9 410 391 51.2% 1185 2 99.8% 80.2%
SRA-DM 674 125 84.4% 1189 0 100.0% 93.7%
DedupEndNote 766 34 95.7% 1188 0 100.0% 97.8%
SRA: Stroke
(1292 rec)
EndNote X9 372 134 73.5% 784 2 99.7% 89.5%
SRA-DM 426 81 84.0% 785 0 100.0% 93.7%
DedupEndNote 503 7 98.6% 782 0 100.0% 99.5%
McKeown
3130 rec
OVID 1982 90 95.7% 1058 0 100.0% 97.1%
EndNote 1541 531 74.4% 850 208 80.3% 76.4%
Mendeley 1877 195 90.6% 1041 17 98.4% 93.2%
Zotero 1473 599 71.1% 1038 20 98.1% 80.2%
Covidence 1952 120 94.2% 1056 2 99.8% 96.1%
Rayyan 2023 49 97.6% 1006 52 95.1% 96.8%
DedupEndNote 2010 62 97.0% 1058 0 100.0% 98.0%
BIG_SET
(4923 rec)
DedupEndNote 3685 271 93.1% 966 1 99.9% 94.5%

Limitations