Error in computing Local Citations with Scopus Input

massimoaria / bibliometrix

An R-tool for comprehensive science mapping analysis. A package for quantitative research in scientometrics and bibliometrics.

https://www.bibliometrix.org

Other

487 stars 144 forks source link

Error in computing Local Citations with Scopus Input #467

Open jwhwaal opened 1 month ago

jwhwaal commented 1 month ago

I have noted there is an error when computing Local Citations with a Scopus input file. The problem does not present itself when using a Web of Science input file. In the histNetwork function there is a filter to remove false positives, based on the PP field. However, not all papers do have PP (page numbers). E.g. Journal of Cleaner Production only uses an article identifier, e.g. Van der Waal, Johannes WH, and Thomas Thijssens. "Corporate involvement in sustainable development goals: Exploring the territory." Journal of Cleaner Production 252 (2020): 119625. It is here: CR <- CR %>% dplyr::filter(!is.na(PY), (substr(CR$PP,1,1) %in% 0:9)) or here: CR <- CR %>% left_join(M_merge, join_by("PY", "AU"), relationship = "many-to-many") %>% dplyr::filter(!is.na(Included)) %>% group_by(PY,AU) %>% mutate(toRemove = ifelse(!is.na(PP.y) & PP.x!=PP.y, TRUE,FALSE)) %>% # to remove FALSE POSITIVE

massimoaria commented 1 month ago

Unfortunately, Scopus has changed the way it stores references and, to date, there is no way to identify them uniquely (the string does not include the DOI!). The use of the PP field is just one 'good enough' strategy we have chosen to adopt, but unfortunately, it does lead to errors in some cases. Errors that would also be there if we decided to adopt a different strategy.

We are of course open to accepting proposals on alternative strategies for identifying local citations in Scopus.

jwhwaal commented 1 month ago

Indeed, the reference does not include the DOI and so lacks a unique key. I have coded a solution that works quite well, but is not fast: extracting the title fields from the references (assuming it is the longest string, most often the case), and then computing the (cosine or levenshtein) similarity with the TI fields of the local corpus. I got very good results on my corpus. To make it faster and avoid problems with truncated titles (which I had in one instance), the similarity matching could be tried on truncated titles (e.g. 100 characters).

massimoaria commented 2 weeks ago

Thanks, I appreciate your suggestion. We have already tried this solution but it is too computationally expensive to be used daily. We also tried to identify publication years to divide references into subgroups (by year) and then compute similarity only among titles associated with the same publication year. The solution is faster than the previous one but unsatisfactory again.