rfhb / ctrdata

Aggregate and analyse information on clinical trials from public registers
https://rfhb.github.io/ctrdata/
Other
41 stars 5 forks source link

CTGOV2 API call does not filter by study sponsor #32

Closed frederikziebell closed 7 months ago

frederikziebell commented 8 months ago

Consider the following example. With the old API, the filter is respected, whereas with the new one, all studies would be downloaded.

library("tibble")
library("ctrdata")

q1 <- tibble(`query-term` = "spons=Pfizer", `query-register` = "CTGOV")

ctrLoadQueryIntoDb(
  queryterm = q1,
  only.count = TRUE
)$n
# 5639

q2 <- tibble(`query-term` = "spons=Pfizer", `query-register` = "CTGOV2")

ctrLoadQueryIntoDb(
  queryterm = q2,
  only.count = TRUE
)$n
# 470145
rfhb commented 8 months ago

Thanks for reporting, nice catch! Fix now available:

As an aside, if not yet known and in case the tibble has no particular need, this also works:

library("ctrdata")
ctrLoadQueryIntoDb(
  queryterm = "spons=NameOfSponsor",
  register = "CTGOV2",
  only.count = TRUE
)$n
frederikziebell commented 8 months ago

Thanks, and also for pointing out the shorter syntax, it's working now. For some companies, I see however differences in the number of returned results between both registers:

ctrLoadQueryIntoDb(
  queryterm = "spons=Janssen",
  register = "CTGOV",
  only.count = TRUE
)$n
# 2352

ctrLoadQueryIntoDb(
  queryterm = "spons=Janssen",
  register = "CTGOV2",
  only.count = TRUE
)$n
# 2347

But I don't know if that's because the new API accesses the data differently from the CTGOV database, or if it's an issue with ctrdata.

rfhb commented 8 months ago

Thanks - you find the same numbers when opening this search query in the browser like below. I have no explanation for this and can only speculate that in the backend, different matching processes take place. Try modifying the sponsor name in the browser and see different expansions offered.

ctrOpenSearchPagesInBrowser(url = "spons=Janssen", register = "CTGOV")
ctrOpenSearchPagesInBrowser(url = "spons=Janssen", register = "CTGOV2")

Nevertheless, it is straightforward to generate a list of the set difference, as follows:

dbc <- nodbi::src_sqlite(collection = "temp")
ctgovTrials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV", con = dbc)
ctgov2Trials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV2", con = dbc)
trialsSet <- dbGetFieldsIntoDf(c("sponsors.lead_sponsor.agency", "brief_title"), con = dbc)
trialsSet[trialsSet[["_id"]] %in% setdiff(ctgovTrials[["success"]], ctgov2Trials[["success"]]), ]

which returns

# A tibble: 5 × 3
  `_id`       sponsors.lead_sponsor.agency brief_title                                                
  <chr>       <chr>                        <chr>                                                      
1 NCT02135354 Wim Janssens                 Azithromycin for Acute Exacerbations Requiring Hospitaliza…
2 NCT02205242 Wim Janssens                 BACE Trial Substudy 1 - PROactive Substudy                 
3 NCT02205255 Wim Janssens                 BACE Trial Substudy 2 - FarmEc Substudy                    
4 NCT02332122 Wim Janssens                 Detection of Aspergillus Fumigatus and Sensitization in CO…
5 NCT05008081 Wim Janssens                 The CATALINA Study  

There you have it, possibly CTGOV uses a partial string match, and CTGOV2 matches differently, see e.g. here https://clinicaltrials.gov/data-about-studies/search-areas#SponsorSearch

frederikziebell commented 8 months ago

Thanks for the clarification. Btw, I get an error with the latest devel build and your example:

dbc <- nodbi::src_sqlite(collection = "temp")
ctgovTrials <- ctrLoadQueryIntoDb(queryterm = "spons=Janssen", register = "CTGOV", con = dbc)

gives

Not overruling register label CTGOV
* Found search query from CTGOV: spons=Janssen
Checking helper binaries: . . . done
Warning: Database not persisting* Checking trials in CTGOV classic...
Retrieved overview, records of 2352 trial(s) are to be downloaded (estimate: 19 MB)
(1/3) Downloading trial file...
Error in handle_setopt(h, ...) : Unknown option: multiplex

The call to ctrLoadQueryIntoDb() with only.count = TRUE works, so I guess the issue concerns multiplexed downloading.

Should I open a separate issue for that?

rfhb commented 8 months ago

Thanks. Could you please update R package curl, version 5.1.0 does not trigger this error; I will specify this requirement.

machado-t commented 8 months ago

Somewhat unrelated, but I'll leave it here for future reference. I was getting this error with CTGOV2: * Checking trials using CTGOV API 2.0.0.-test...Warning: Error in curl::curl_fetch_memory: Timeout was reached: [www.clinicaltrials.gov] Resolving timed out after 10011 milliseconds ... which was apparently also solved by updating curl.

Edit: Actually unrelated to curl update. Not sure why, but I'm getting this sometimes.

rfhb commented 7 months ago

Indeed completely unrelated to ctrdata, possibly a network or server issue.