backend performance - Githubissues

neon-ninja commented 4 years ago

Hi Layik,

Thanks for making this open source! I had a few questions about your implementation, apologies if these questions would be better suited to the geoplumber repository:

It seems to me that the R backend / geoplumber is used to read geospatial data, convert it to geojson if necessary, and then serve the geojson via API. However, why not cut out the middleman, and get the browser to load geojson directly, from GitHub pages for example? People who want to reuse this wouldn't need to run a server / write R code for each new dataset, and the geojson could be cached by the browser / compressed / served over a CDN. R / plumber / geoplumber operate single threaded - so if one user is requesting some data that takes a long time to download, it can slow down the experience for other users. This can be mitigated by running multiple docker containers and load balancing across them - but this seems like over-engineering when there are freely available CDNs that would be more performant.

For example, with my wired gigabit connection here in Auckland, New Zealand:

time curl http://eatlas.geoplumber.com/api/covid19 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43.2M  100 43.2M    0     0   324k      0  0:02:16  0:02:16 --:--:--  501k

real    2m16.520s
user    0m0.186s
sys 0m0.431s

It takes 2 minutes and 16 seconds to load 43MB from your geoplumber.com server, probably due to the 300ms ping.

Cheers, Nick Young Centre for eResearch, University of Auckland

layik commented 4 years ago

Hello Nick!

Thanks for looking into the eAtlas and I understand the issue. However, browsers are built with CORS and are strictly implemented (rightly so) and "web servers" are also increasingly add HTTP headers which means querying them from a browser ~will not be~ is not always allowed.

So this will hopefully be covered in some documentation, remember the repo still has a "WIP" badge on it. So your contribution here is most welcome.

As for eatlas.geoplumber.com is just a play area for me, right now just adding different covid19 data samples. Current one, 43.M is not suitable for any production I believe, though I think I did not successfully get Apache to GZIP application/json. But During some other checks I manage to save over 60% of original size of the data served.

Back to the GitHub pages for example should be doable from the "Add data" space which allows use of URLs.

Does that cover the issue?

neon-ninja commented 4 years ago

Thanks for your reply. I think the CORS issue would be solved by the CDN - https://raw.githack.com/ for example adds access-control-allow-origin: * to all responses. Yes, it seems possible to use the "Add data" modal with this sample URL https://rawcdn.githack.com/datasets/geo-countries/cd9e0635901eac20294a57ee3b3ce0684d5e3f1a/data/countries.geojson - however this seems to only work with a older version of the code for some reason, it doesn't seem to work on eatlas.geoplumber.com at the moment.

Screenshot:

Screenshot from 2020-03-17 08-27-35

For reference, this 23.2MB geojson loads in 462ms for me. So generally, I would imagine the eAtlas could support multiple possible backends - anything that can serve up geojson. Where possible I would recommend using CDNs for performance reasons. Perhaps adding some CDN examples, or documenting the possibility for alternate backends might be a good idea?

layik commented 4 years ago

I thought when I last prepared a demo this worked fine.

It needs an enter key after typing/pasting URLs, you are right it used to keep checking as user typed. Should I remove the need for enter key? Can you run the dev frontend on its own?

The eatlas.geoplumber.com instance does not pull latest docker image. Apologies for any confusion there.

Thanks Nick.

layik commented 4 years ago

@neon-ninja now both http://geoplumber.com and the subdomain point to some tiny amount of data. I cannot get the UK breakdown automatically as it has been locked away by ArcGIS. But should look decent with the theme branch merged into it.

Not quite related to the ticket I know.

neon-ninja commented 4 years ago

Hi @layik - nice - that 30KB loads way faster (631 ms). I think requiring an enter key is fine - you don't want to be firing off 404s as the user types. Yes, I was able to run the dev frontend on it's own with npm start. https://rawcdn.githack.com/datasets/geo-countries/cd9e0635901eac20294a57ee3b3ce0684d5e3f1a/data/countries.geojson seems to work in geoplumber.com and the subdomain now - one of your last few commits must have fixed the problem.

layik commented 4 years ago

Here is what you might like to hear @neon-ninja and for anyone else landing here. I now know that R package data.table can beat even Redis. I will try to publish that in a blog post but the C code in data.table is literally fastest thing I have ever touched.

I subset 52M rows in "real" time. It is even faster than generating maptiles. Why? Becuase I do not need to generate multiple tiles/geojson. Do the geometry once and assemble on client side.

More on this hopefully soon but R is on another level thanks to data.table.

layik commented 3 years ago

Just a courtesy message to @neon-ninja, quite a lot has been experimented with with R.data.table including using it to subset substantial (large) "SimpleFeature"s without the need for even sf package. I have also decreased the amount of data the "repo" out of the box sends. If any particular issue is still there, happy to address it.

tgve / tgvejs

backend performance #9