lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
26 stars 16 forks source link

Use curl::curl_unescape() instead of utils::URLdecode() #116

Closed hpages closed 3 months ago

hpages commented 3 months ago

This fixes a serious performance regression introduced by commit 81ca4c738c75fbefa3011007199da6f10928305b (part of my previous PR, PR #112) where I replaced the call to RCurl::curlUnescape() with a call to utils::URLdecode() in this line: https://github.com/lawremi/rtracklayer/blob/6bbc5509d511cbb7333c712c67a4f782584e265c/R/web.R#L55

Unfortunately the latter is about 100x slower than the former! :disappointed:

This drastically impacts the speed of readGFF() when used on a big GFF file like the one used in this example. With rtracklayer 1.63.1:

gff <- "GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz"
readGFF(gff)

takes more than 10 min!

This PR replaces all calls to utils::URLdecode() with calls to curl::curl_unescape().

Note that performance of curl::curl_unescape() is similar to that of RCurl::curlUnescape(). With rtracklayer 1.63.2:

readGFF(gff)

takes < 1 min, like it did with versions of rtracklayer prior to 1.63.1.

Thanks, H.

P.S.: Note that rtracklayer also uses utils::URLencode() in various places. However these calls were around before PR #112 and the PR didn't touch that so I'm leaving this alone. However, it might be worth considering replacing these calls with calls to curl::curl_escape() at some point.

hpages commented 3 months ago

@lawremi @sanchit-saini Can someone take a look at this? Let me know if there's anything I need to change or if you have any question. Thanks!

lawremi commented 3 months ago

Thanks, Hervé!