invalid char in json text gives error in processing data when using oa_fetch

nikkilagerweij commented 1 year ago

Hi,

I'm running into an issue in R (I am still a beginner and it's the first time I'm using the OpenAlex API and R package). The error comes from an invalid character in the json text and I am not sure how to avoid this and continue the download of the records. My query is the following:

works2000 <- oa_fetch( entity = "works", from_publication_date = "2000-01-01", to_publication_date = "2000-12-31", options = list(select = c("id", "publication_year", "corresponding_author_ids", "corresponding_institution_ids", "language", "open_access", "type" )), verbose = TRUE)

I am trying to get a few columns for all publications world-wide for the year 2000. The error I get when I'm 69% through the download is:

lexical error: invalid char in json text. There was an error processing y (right here) ------^

Does anyone know how to solve this issue? Thanks!

trangdata commented 1 year ago

Hi @nikkilagerweij this is a lot of works to be downloaded. Could you narrow down to see what month you're having the issue so we can help debug, please?

nikkilagerweij commented 1 year ago

Hi Trang, thanks for looking at the issue.

I have started to narrow down and found that the error occured in April. But it's a bit strange because in general, for any month I try, I only get the fields of work ID, publication year, and type in a dataframe. It leaves out: corresponding_author_ids, corresponding_institution_ids, language, and open_access.

When I run oa_fetch for April with all fields as in the original query, I get the error. When I include only id, year, and type, it does not give back the error. However, for my thesis I will need these other fields as well.

To clarify with the code. This gives an error:

works04_2000 <- oa_fetch(
  entity = "works",
  from_publication_date = "2000-04-01",
  to_publication_date = "2000-04-30",
  options = list(select = c("id", "publication_year", "type", "open_access", 
                            "corresponding_author_ids", "corresponding_institution_ids")),
  verbose = TRUE)

lexical error: invalid char in json text. There was an error processing y (right here) ------^

But I don't get an error from this:

works04_2000 <- oa_fetch(
  entity = "works",
  from_publication_date = "2000-04-01",
  to_publication_date = "2000-04-30",
  options = list(select = c("id", "publication_year", "type")),
  verbose = TRUE)

trangdata commented 1 year ago

Oh another thing I forgot to mention: could you try works04_2000_ls <- oa_fetch(..., output = "list") and see if that works? Then, you can do works2df(works04_2000_ls) which will probably error out but allow us to do a better investigation on where the bug occurs.

Also, I know it's difficult to narrow down, but if you could find out what date it is, it would be much helpful. Currently, I see 158,107 works in April 2000 which would put a lot of stress on the API if I downloaded everything.

nikkilagerweij commented 1 year ago

So, I tried a couple of things today 😊

April suddenly works again when using all original fields in the query, but it just gives me back 3 of those fields (id, year, type). But no errors this time.

First, I tried “output = list” and then convert to dataframe (I had to use the oa2df function as works2df was somehow not recognized).

Then I left out “output = list”, to try April again, and the error was not there anymore either. So, the error that I uploaded disappeared.

But, I think the issue remains because of these fields that I try to select, because the full year of 2000 gives me back errors when selecting all fields, but does not show up if only select work id, publication year, and type as fields. Interestingly, the errors change in terms of when they show up (% of downloading).

I don’t understand why this error message comes when I select all fields? Even without the error, I never get the fields in my dataframe although they should be part of the select option right?

Hope we can figure this out ! :-)

trangdata commented 1 year ago

@nikkilagerweij You're right, the OA database updates every day, so it may be difficult to reproduce the result. I still think that the original error you got

lexical error: invalid char in json text.
There was an error processing y
(right here) ------^

comes from the package trying to convert output list to dataframe as the last step of oa_fetch. So adding "output = list" should resolve the issue, then we can convert smaller chunks in the list to dataframe to see where the problem occurs.

Another suggestion I have is to download the snapshot. This would greatly relieve the API from some stress.

nikkilagerweij commented 1 year ago

Ah alright! I can try that but I'm not sure if it gets me to the issue of not all selected fields showing up?

Actually I do have the snapshot on my computer! But when I tried to flatten the data it became a bit tricky.

Can I use this package also on the snapshot? You think it will solve the issue with the other fields not appearing?

trangdata commented 1 year ago

I have not worked with the snapshots before, but I imagine you can adapt the code for works2df to convert them to dataframe.

I'm still unsure what the issue you're experiencing with select. My guess is that one of the works does not have all the specified fields, but you can set these values to NA when you aggregate them all.

nikkilagerweij commented 1 year ago

The issue I am getting with the select function is that the following fields are suggested to use (this message comes when I choose a wrong field name):

Valid fields for select are: id, doi, title, display_name, relevance_score, publication_year, publication_date, ids, language, primary_location, type, open_access, authorships, corresponding_author_ids, corresponding_institution_ids, apc_list, apc_paid, is_authors_truncated, cited_by_count, biblio, is_retracted, is_paratext, concepts, mesh, locations_count, locations, best_oa_location, grants, referenced_works_count, referenced_works, related_works, ngrams_url, abstract_inverted_index, cited_by_api_url, counts_by_year, updated_date, created_date.>

If I then choose for example "id", "publication_year", "language", "type", "open_access", "corresponding_author_ids", "corresponding_institution_ids"

I only get back the fields of "id", "publication_year", "type"- or it gives me the invalid char error and breaks.

When I select authorships(I tested oa_fetch for the months February, March, April 2000 - no errors), it does give back the field of authorships, which includes some of the fields I am looking for. However, this is a nested dataframe. For my analyses I will need to flatten them but if I use the jsonlite function flatten e.g., it does not work.

That's why I was trying to use the "corresponding_author_ids", "corresponding_institution_ids" fields directly so I would not have to flatten them.

open_access is another field that does not show up in my dataframes. But I find on the OA doc (https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/select-fields) that this field should be possible to select?

Can you maybe confirm to me which (flat) fields are possible to select and is it possible with the package to flatten fields if needed? That would be a super helpful option to have! :)

Let me know if anything is unclear or if I should test something else first :)

trangdata commented 1 year ago

I see. The key point here is that when you use the default output (tibble) in oa_fetch, in the conversion to dataframe, these (newer) "fields" are not presented in the final dataframe as columns, which they should. I have made an issue for this #118.

Now, in the mean time, if you set output = "list", these fields would still show up in the list result. You're right, flattening this list is not trivial (hence we provide the oa2df function — but this needs updating when the API changes). Anyhow, I have made a reproducible example below (getting works with a lot of citations on April 1, 2000).

library(openalexR)
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
papers_raw <- oa_fetch(
  entity = "works",
  publication_date = "2000-04-01",
  cited_by_count = ">1000",
  options = list(select = c(
    "id", "publication_year", "language", "type", "open_access",
    "corresponding_author_ids", "corresponding_institution_ids"
  )),
  verbose = TRUE,
  output = "list"
)
#> Requesting url: https://api.openalex.org/works?filter=publication_date%3A2000-04-01%2Ccited_by_count%3A%3E1000&select=id%2Cpublication_year%2Clanguage%2Ctype%2Copen_access%2Ccorresponding_author_ids%2Ccorresponding_institution_ids
#> Getting 1 page of results with a total of 128 records...

# this line is a bit cryptic but essentially changes all NULL elements of
# the nested loop to NA
papers <- openalexR:::simple_rapply(papers_raw, openalexR:::`%||%`, y = NA)

papers <- map(papers, function(x) {
  x$corresponding_institution_ids <- list(x$corresponding_institution_ids)
  x$open_access <- list(as_tibble(x$open_access))
  if (length(x$corresponding_author_ids) == 0){
    x$corresponding_author_ids <- NA  
  } else {
    x$corresponding_author_ids <- list(x$corresponding_author_ids)
  }
  x
})
paper_df <- bind_rows(papers)

# one work with more than one corresponding_author_ids will be present 
# in more than 1 row in the `unnested` dataframe below
unnested <- paper_df |> 
  tidyr::unnest(corresponding_author_ids, keep_empty = TRUE)

^{Created on 2023-06-18 with reprex v2.0.2}

nikkilagerweij commented 1 year ago

Ah that's great, thank you so much for providing me with the code! I'm trying it out and modifying it to what I need. Thanks a lot. I'll keep an eye out on the new issue you created :)

trangdata commented 1 year ago

@nikkilagerweij Good luck! And let me know if you run into any issues re unnesting. I'd look into some docs on list columns if you need examples of how to deal with nested columns.

ropensci / openalexR

invalid char in json text gives error in processing data when using oa_fetch #117