CTGOV2 API call does not filter by study sponsor

frederikziebell commented 8 months ago

Consider the following example. With the old API, the filter is respected, whereas with the new one, all studies would be downloaded.

library("tibble")
library("ctrdata")

q1 <- tibble(`query-term` = "spons=Pfizer", `query-register` = "CTGOV")

ctrLoadQueryIntoDb(
  queryterm = q1,
  only.count = TRUE
)$n
# 5639

q2 <- tibble(`query-term` = "spons=Pfizer", `query-register` = "CTGOV2")

ctrLoadQueryIntoDb(
  queryterm = q2,
  only.count = TRUE
)$n
# 470145

rfhb commented 8 months ago

Thanks for reporting, nice catch! Fix now available:

Corrected translation of some fields from browser URL to API call for CTGOV2, including sponsor and location, added testing for translation of all parameters
Please try 0447b0a4bf08feedafa849bdc4061dcfaec85087 with devtools::install_github("rfhb/ctrdata")
Will be included in a next release in the next days

As an aside, if not yet known and in case the tibble has no particular need, this also works:

library("ctrdata")
ctrLoadQueryIntoDb(
  queryterm = "spons=NameOfSponsor",
  register = "CTGOV2",
  only.count = TRUE
)$n

frederikziebell commented 8 months ago

Thanks, and also for pointing out the shorter syntax, it's working now. For some companies, I see however differences in the number of returned results between both registers:

ctrLoadQueryIntoDb(
  queryterm = "spons=Janssen",
  register = "CTGOV",
  only.count = TRUE
)$n
# 2352

ctrLoadQueryIntoDb(
  queryterm = "spons=Janssen",
  register = "CTGOV2",
  only.count = TRUE
)$n
# 2347

But I don't know if that's because the new API accesses the data differently from the CTGOV database, or if it's an issue with ctrdata.

rfhb commented 8 months ago

Thanks - you find the same numbers when opening this search query in the browser like below. I have no explanation for this and can only speculate that in the backend, different matching processes take place. Try modifying the sponsor name in the browser and see different expansions offered.

ctrOpenSearchPagesInBrowser(url = "spons=Janssen", register = "CTGOV")
ctrOpenSearchPagesInBrowser(url = "spons=Janssen", register = "CTGOV2")

Nevertheless, it is straightforward to generate a list of the set difference, as follows:

dbc <- nodbi::src_sqlite(collection = "temp")
ctgovTrials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV", con = dbc)
ctgov2Trials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV2", con = dbc)
trialsSet <- dbGetFieldsIntoDf(c("sponsors.lead_sponsor.agency", "brief_title"), con = dbc)
trialsSet[trialsSet[["_id"]] %in% setdiff(ctgovTrials[["success"]], ctgov2Trials[["success"]]), ]

which returns

# A tibble: 5 × 3
  `_id`       sponsors.lead_sponsor.agency brief_title                                                
  <chr>       <chr>                        <chr>                                                      
1 NCT02135354 Wim Janssens                 Azithromycin for Acute Exacerbations Requiring Hospitaliza…
2 NCT02205242 Wim Janssens                 BACE Trial Substudy 1 - PROactive Substudy                 
3 NCT02205255 Wim Janssens                 BACE Trial Substudy 2 - FarmEc Substudy                    
4 NCT02332122 Wim Janssens                 Detection of Aspergillus Fumigatus and Sensitization in CO…
5 NCT05008081 Wim Janssens                 The CATALINA Study

There you have it, possibly CTGOV uses a partial string match, and CTGOV2 matches differently, see e.g. here https://clinicaltrials.gov/data-about-studies/search-areas#SponsorSearch

frederikziebell commented 8 months ago

Thanks for the clarification. Btw, I get an error with the latest devel build and your example:

dbc <- nodbi::src_sqlite(collection = "temp")
ctgovTrials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV", con = dbc)

gives

Not overruling register label CTGOV
* Found search query from CTGOV: spons=Janssen
Checking helper binaries: . . . done
Warning: Database not persisting* Checking trials in CTGOV classic...
Retrieved overview, records of 2352 trial(s) are to be downloaded (estimate: 19 MB)
(1/3) Downloading trial file...
Error in handle_setopt(h, ...) : Unknown option: multiplex

The call to ctrLoadQueryIntoDb() with only.count = TRUE works, so I guess the issue concerns multiplexed downloading.

Should I open a separate issue for that?

rfhb commented 8 months ago

Thanks. Could you please update R package curl, version 5.1.0 does not trigger this error; I will specify this requirement.

machado-t commented 8 months ago

Somewhat unrelated, but I'll leave it here for future reference. I was getting this error with CTGOV2: * Checking trials using CTGOV API 2.0.0.-test...Warning: Error in curl::curl_fetch_memory: Timeout was reached: [www.clinicaltrials.gov] Resolving timed out after 10011 milliseconds ... which was apparently also solved by updating curl.

Edit: Actually unrelated to curl update. Not sure why, but I'm getting this sometimes.

rfhb commented 7 months ago

Indeed completely unrelated to ctrdata, possibly a network or server issue.

rfhb / ctrdata

CTGOV2 API call does not filter by study sponsor #32