ropensci / crul

R6 based http client for R (for developers)
https://docs.ropensci.org/crul
Other
106 stars 17 forks source link

Add support for pagination for Async calls #160

Closed moldach closed 3 years ago

moldach commented 3 years ago

Hoping you can help me try to troubleshoot an error I'm running into.

When I make a standard curl request for the API from the command line I see that there are many pages for TNF-alpha: curl -X GET "https://api.targetsafety.info/api/target/alerts/param?uniprotid=P01375&page=1&token=[MyPrivateKey]" > tnf_alpha.json

We see that the created .json file shows that there are 17 pages (and that this request is only showing page 1/17):

{
  "page": 1,
  "numberPages": 17,
  "targets": [
    {
      "target_id": 348,
      "target_name": "TNF-alpha",
      "actions": [
        {
          "action_id": 15,
          "action_name": "Inhibitors",
          "alerts": [
            {
              "affected_system_id": 10005329,
              "affected_system": "blood and lymphatic system disorders",
              "adverse_event_id": 10043554,
              "adverse_event": "thrombocytopenia",
              "ref_id": 95580,
...

Since I'm making 100s thousands of API calls with AsyncQueue() each of them will have different number of pages.

How is it possible to crawl each of these pages using AsyncQueue()?

Currently only the 1st of x pages are being shown (_note: dropping &page=1 from the url results in a broken API call - the exact page number must be specified).

However, currently I see the following note in the documents about Paginator:

Paginator is a class for automating pagination. It requires an instance of HttpClient as it’s first parameter. It does not handle asynchronous requests at this time, but may in the future. Paginator may be the right class to use when you don’t know the total number of results. Beware however, that if there are A LOT of results (and a lot depends on your internet speed and the server response time) the requests may take a long time to finish - just plan wisely to fit your needs.

If this isn't supported yet it would be greatly welcomed in the near future 😁

sckott commented 3 years ago

Thanks for the bump, will try to get this done

sckott commented 3 years ago

I think this would only work for async when we can construct URLs ahead of time because the whole point of async requests is to send off a bunch of requests at the same time. Thus, it can't work if you have to do request A to get the information to do request B. Should work if e..g, you know you want 1000 results, and you know the pagination query param names

moldach commented 3 years ago

Yeah it looks like it would need to be a two-step process then since we don't know the range of values to give for page=<n> where n is unknown:

Alerts Information Uniprot - Get target alerts by uniprotid

GET
https://api.targetsafety.info/api/target/alerts/param?

Parameter

Field | Type | Description -- | -- | -- uniprotid | String | [required] Uniprot ID page | Number | [required] Page number token | String | [required] Token adverse_event_idoptional | Number | [optional] Adverse Event ID ref_source_typeoptional | String | [optional] Reference source type alert_typeoptional | String | [optional] Alert Type alert_phaseoptional | String | [optional] Alert Phase alert_level_evidenceoptional | String | [optional] Level of evidence alert_onoff_targetoptional | String | [optional] On/Off target alert_severityoptional | String | [optional] Alert Severity alert_speciesoptional | String | [optional] Alert Species alert_date_fromoptional | String | [optional] Date from alert_date_tooptional | String | [optional] Date to order_by_dateoptional | String | [optional] Order by dateOnly one ordering is possible order_by_advoptional | String | [optional] Order by adverse eventOnly one ordering is possible

Success 200

Field | Type | Description -- | -- | -- page | Number | Page numberPages | Number | Number pages target_id | Number | Target ID target_name | String | Target name action_id | Number | Action ID action_name | String | Action name alerts | Object[] | Alerts affected_system_id | Number | Affected system ID affected_system | String | Affected system adverse_event_id | Number | Adverse event ID adverse_event | String | Adverse event ref_id | Number | Reference ID ref_source_type | String | Reference source type ref_title | String | Reference title ref_citation | String | Reference citation ref_pubmed_id | Number | PubMed ID ref_link | String | Reference link ref_date | String | Reference date alert_detail_id | Number | Alert detail ID alert_title | String | Alert title alert_date | String | Alert date alert_genetic_study_variant | String | Alert genetic study variant alert_type | String | Alert type alert_phase | String | Alert phase alert_onoff_target | String | On/Off target alert_level_evidence | String | Level of evidence alert_severity | String | Alert severity alert_species | String | Alert species drugs | Object | Drugs related drug_id | Number | Drug ID drug_name | String | Drug name
{
    "page": 1,
    "numberPages": 11,
    "targets": [
        {
            "target_id": 158,
            "target_name": "SGLT2",
            "actions": [
                {
                    "action_id": 2,
                    "action_name": "Activators",
                    "alerts": [
                        {
                            "affected_system_id": 10015919,
                            "affected_system": "eye disorders",
                            "adverse_event_id": 10015916,
                            "adverse_event": "eye disorder",
                            "ref_id": 62872,
                            "ref_source_type": "Journal",
                            "ref_title": "Leveraging Human Genetics to Identify Safety Signals Prior to Drug Marketing Approval and Clinical Use",
                            "ref_citation": "Drug Saf 2020 Feb 28",
                            "ref_pubmed_id": "32112228",
                            "ref_link": null,
                            "ref_date": "2020-02-28",
                            "alert_detail_id": 662867,
                            "alert_title": "Phenome-wide association study identifying human gene mutations that could be used for in silico prediction of potential adverse drug effects. Results revealed 8 positive associations correlating gene mutation phenotypes with known safety signals from drugs targeting the protein. These associations were PCSK9 (spina bifida), TNF-alpha (cellulitis and leg abscess), PPARgamma (obesity), estrogen receptor-alpha (hemorrhages), ACE (congenital urinary anomalies), phospholipase A2 (primary hypercoagulable state), GluN2B (symbolic dysfunction) and GluN2A (paroxysmal tachycardia, pulmonary heart disease and sleep disorders). Other safety issues are listed.",
                            "alert_date": "2020-03-11",
                            "alert_genetic_study_variant": "gain-of-function mutation",
                            "alert_type": "Class Alert",
                            "alert_phase": "Target Discovery",
                            "alert_onoff_target": "On-Target",
                            "alert_level_evidence": "Suspected",
                            "alert_severity": "no",
                            "alert_species": "human",
                            "drugs": []
                        }
                    ]
                }
            ]
        }
    ]
}

Only upon a successful API call (Success 200) would we get n from numberPages. So with a bit more effort we could grep numberPages from each successful API call and then construct these URLs ahead of time.

Closing this issue since I asked them to provide us with a bulk data download instead... 💁🏼