rempsyc / busara_dashboard

The Missing Majority in Behavioural Science Dashboard
https://remi-theriault.com/dashboards/missing_majority
1 stars 0 forks source link

Exploring OpenAlex Interface and Database #46

Closed rempsyc closed 3 months ago

rempsyc commented 4 months ago

I will use this issue to explore and document potential problems with the OpenAlex interface and database.

rempsyc commented 4 months ago

A first observation is that OpenAlex, like other databases, also provide several candidate journals when searching for one specific journal. However, it also provides a unique OpenAlex ID, which can then be used to filter the list, identify the correct ID, and use the ID directly for future requests, so this categorizing work only has to be done once for each journal. This is also great to get rid of the inherent ambiguity in journal names, which was a big issue with easyPubMed. Another benefit is that openalexR seems to ignore capitals, which is great because in easyPubMed having the wrong letters capitalized would lead to missing journals.

There are still some ambiguity in titles for some journals. For example, searching for "Journal of abnormal psychology" provides the following results:

  id                               display_name                                                      works_count
  <chr>                            <chr>                                                                   <int>
1 https://openalex.org/S121947241  "Journal of abnormal psychology"                                         7645
2 https://openalex.org/S58220144   "Journal of abnormal child psychology"                                   3173
3 https://openalex.org/S45104564   "Journal of abnormal and social psychology"                              4269
4 https://openalex.org/S188273073  "\u0098The \u009cJournal of abnormal psychology and social psychology"    272
5 https://openalex.org/S4210220944 "\u0098The \u009cJournal of abnormal psychology"                          148

So for example the right journal is no 1, 2-4 are incorrect, but what about no 5? Even though it has a small journal count, it seems like the right title, except for the unicode symbol at the beginning which could be a simple typo and make us miss some precious journal data.

rempsyc commented 4 months ago

No records found for the following journals:

It exists in their system here, but the API provides no results:

data <- openalexR::oa_fetch(
  entity = "works",
  journal = "https://openalex.org/W4251276908"
)
#> Warning in oa_request(oa_query(filter = filter_i, multiple_id = multiple_id, :
#> No records found!

Created on 2024-05-14 with reprex v2.1.0


For "Journal of African development", we get two similarly named journals (apparently they're different):

  id                               display_name                       works_count
  <chr>                            <chr>                                    <int>
1 https://openalex.org/S4210182335 Journal of African development             211
2 https://openalex.org/S2764617097 The Journal of African Development         211

Similar with PANAS and some others:

  id                               display_name                                                               works_count
  <chr>                            <chr>                                                                            <int>
1 https://openalex.org/S125754415  Proceedings of the National Academy of Sciences of the United States of America 158129
2 https://openalex.org/S4306524276 Proceedings of the National Academy of Sciences                                    108

Collabra:

  id                               display_name         works_count
  <chr>                            <chr>                      <int>
1 https://openalex.org/S4210175756 Collabra. Psychology         370
2 https://openalex.org/S2737007392 Collabra                      29
rempsyc commented 4 months ago

There are also some journal abbreviations sometimes but it is unreliable and often incorrect, so I am planning to write the journal abbreviations myself just like the acronyms so it looks better on the waffle plots for example.

That brings problems in terms of abbreviations ambiguity, for example, both the "Journal of abnormal psychology" and "Journal of applied psychology" abbreviate to "JAP", "Journal of Educational Psychology" and "Journal of Economic Psychology" both abbreviate to "JEP", etc.