mjwestgate / revtools

Tools to support research synthesis in R
https://revtools.net
48 stars 26 forks source link

Failed duplicate removal in non-English articles #15

Closed befriendabacterium closed 5 years ago

befriendabacterium commented 5 years ago

I noticed that your duplicate removal functions are not performing as intended sometimes and one of the main reasons is non-English articles which often have a square brackets translation at the end as so [nicht-englische Artikel, die oft eine eckige Klammer am Ende haben]. There might be a way to fix this and up the accuracy of the function more easily (e.g. if ~50% of the words match?), though be careful as some translated titles just started with square brackets as you've probably seen? Also I'm not sure whether your duplicate removal system uses dois, but i found removal duplicate dois helped too (though not on some of these non-English articles esp. Chinese articles which often don't have dois). Hope this helps!

mjwestgate commented 5 years ago

Thanks for this Matt - it really helped me work out what the key features of the de-duplication functions in revtools should be. Both dois and the multi-language case you mention have the property that a subset of the text should match perfectly. You can now check for this using fuzz_partial_ratio as follows:

# create some example data
data <- data.frame(
  dois = c(
    "http://dx.doi.org/10.1016/j.biocon.2017.04.031",
    "10.1016/j.biocon.2017.04.031", # same doi as [1]
    "10.1016/j.biocon.2013.07.028"
  )
)

# check a pair of values
fuzz_partial_ratio(data$dois[1], data$dois[2]) # = 0

# apply to a full dataset of dois
find_duplicates(
  data = data,
  match_variable = "dois",
  group_variable = NULL,
  match_function = "fuzzdist",
  method = "fuzz_partial_ratio",
  threshold = 0
)
# returns: 1 1 2 (i.e. first two dois are matched, third is different)

# check this works for your text example:
fuzz_partial_ratio(
  "non-English articles which often have a square brackets translation at the end as so [nicht-englische Artikel, die oft eine eckige Klammer am Ende haben]",
  "non-English articles which often have a square brackets translation at the end as so"
)
# again, distance = 0

Of course, there is a risk that some articles will share some small subset of text in their titles by chance, and therefore get matched by find_duplicates in error; so some care is required here.