ropensci / aRxiv

Programmatic interface to the Arxiv API
https://docs.ropensci.org/aRxiv
Other
60 stars 10 forks source link

Corrupt record handling #53

Open Ariel225 opened 3 years ago

Ariel225 commented 3 years ago

Certain records seem to cause a crash. We have narrowed it down to this query, which should retrieve all records submitted in a one-minute period of 22:16 to 22:17 on January 24, 2018.

dfy<-arxiv_search(query = "submittedDate:[201801242216 TO 201801242217]", limit = 15000, batchsize=2000)

which returns an error of:

> Error in attr(results, "search_info") <- search_attributes(query, id_list,  : 
>   attempt to set an attribute on NULL
> 

We can isolate the record, which appears to be this one: https://arxiv.org/abs/1610.04266

If we were to search using title, the same error appears: dfy<-arxiv_search(query = "ti:Fourfolds", limit = 1200, batchsize=300) We therefore think that either the record is corrupt (e.g., hidden unintentional column delimiter, etc.)

A similar error occurs on this single-date range, though we have not isolated the individual record causing the error: dfy<-arxiv_search(query = "submittedDate:[201612030000 TO 201612040000]", limit = 15000, batchsize=2000) Does the query need to be modified? Can the query auto-skip corrupt records? Should arxiv be notified?

kbroman commented 3 years ago

Thanks for your very clear bug report! I'll look into the details. I see that arxiv_search(query="ti:Fourfolds", limit=100) works but arxiv_search(query="ti:Fourfolds", limit=101) gives the error.

I'll follow both of your suggestions: trap such errors better and also report the problem to arxiv, if there's a problem either with the record or with their API.

kbroman commented 3 years ago

Okay, I get it. For this search, you get proper results if limit <= 77, but if limit >= 78, it returns NULL. If batchsize < limit and you're in this latter case, you get the error about assigning attributes to NULL.

 > dim(result <- arxiv_search(query="ti:Fourfolds", limit=77))
 [1] 77 15
 > dim(result <- arxiv_search(query="ti:Fourfolds", limit=78))
 [1]  0 15
!> dim(result <- arxiv_search(query="ti:Fourfolds", limit=78, batchsize=50))
 retrieved batch 1
 Error in attr(results, "search_info") <- search_attributes(query, id_list,  :
   attempt to set an attribute on NULL