osmlab / name-suggestion-index

Canonical common brand names, operators, transit and flags for OpenStreetMap.
https://nsi.guide
BSD 3-Clause "New" or "Revised" License
713 stars 868 forks source link

File size optimization #6066

Open Dimitar5555 opened 2 years ago

Dimitar5555 commented 2 years ago

Currently nsi-id-presets.min.json is about 8MB in size which is a lot. A possible way to solve this problem is to split it per country. There will be duplicated information across files, but it will load much faster for all users. A side effect would be that it will reduce loading times of the editors and speed up the validation speed of brands. Such change will require changes in the build process and changes to how iD and RapiD handle brands (which is why I'm tagging @tyrasd and @bhousel). Are you in support of such change and would it be too hard to implement? Also, are there any other data consumers that will be affected? If there are, it would be nice to leave both "systems" to coexist and add a folder in dist where the new per country method is used.

bhousel commented 2 years ago

I've thought about spending time on this , but I really think it would be hard to implement, and the gains would be less than what you might think. Because we fetch these files from a CDN, the file size and speed is a lot faster than people realize:

👇 here's the numbers I see in the Chrome dev tools today:

Screen Shot 2022-01-14 at 3 35 50 PM

Screen Shot 2022-01-14 at 3 22 59 PM

Put another way, these requests finish faster even than we can fetch a screen of imagery tiles, and actually a lot faster than we fetch a tile of OSM data.

bhousel commented 2 years ago

Also, are there any other data consumers that will be affected? If there are, it would be nice to leave both "systems" to coexist and add a folder in dist where the new per country method is used.

Tagging @bryceco too, as we've had some conversations about this.. I'm definitely willing to produce data in other formats (like protobuf or sqlite, or chop up into world regions) if it helps.

bryceco commented 2 years ago

I did write a script to dump the contents into SQLite, but there wasn’t any savings on total size. I’ll see if I can find the script if you want to double-check my approach.

Dimitar5555 commented 2 years ago

I've thought about spending time on this , but I really think it would be hard to implement, and the gains would be less than what you might think. Because we fetch these files from a CDN, the file size and speed is a lot faster than people realize:

* it's actually gzipped so the size is significantly smaller than the the original json

* the cdn supports http3, so it can pipeline these connections

* it's available everywhere worldwide pretty fast

That's great but how are we on the validation front? How long does it take to validate 100 objects for branding when all brands are loaded and when only country specific brands are loaded?

bryceco commented 2 years ago

SQLite script is here: https://github.com/bryceco/GoMap/blob/master/src/presets/nsiSqlite.py

bhousel commented 2 years ago

That's great but how are we on the validation front? How long does it take to validate 100 objects for branding when all brands are loaded and when only country specific brands are loaded?

I'm not sure how to test this, but all of the data goes into the match index and validation is nearly instant once that index is ready.

There is a roughly 1 second stutter when building the index, this occurs after the files have been fetched and the locations have been resolved. When I tested it just now, it happened about 7 seconds into my editing session. Screen Shot 2022-01-14 at 4 15 14 PM (per #5385, I'd like to find a way to move this work into the background, but it's not too bad currently).

UKChris-osm commented 2 years ago

I've thought about having smaller files and less redundant data downloaded to users for a while, as from my perspective I only ever really edit OSM within the UK, so any non-UK based brands aren't going to get used when I edit.

But in this day and age, I don't think 8mb is a huge amount of data as-is, and like Brian has said the use of GZIP, CDN's and HTTP3 makes the file transfer a pretty quick download, as the file itself is only around 1mb.

However, having noticed the delay in being able to use the NSI data within iD (#5669 & #5790) I do think maybe splitting the files might make the initialisation process within iD faster, but then would that saving be lost by the extra processing needed to determine which file should be downloaded in the first place - such as adding user preferences within iD.

I don't really know how iD processes the NSI data and integrates it, but if iD is doing a lot of work after downloading the NSI data, is that processing something we could do on the NSI side first, so iD doesn't have to build a feature set or anything like that?

Another thing I am curious about, is the NSI data in iD one big array / object? So if I type "M" will every entry within the NSI be scanned through to see if the letter "M" matches, and if I then typed "Mc" again would every entry within the NSI be scanned for a match?

Dimitar5555 commented 2 years ago

However, having noticed the delay in being able to use the NSI data within iD (#5669 & #5790) I do think maybe splitting the files might make the initialisation process within iD faster, but then would that saving be lost by the extra processing needed to determine which file should be downloaded in the first place - such as adding user preferences within iD.

That's why I started this issue (I should've clarified it in the beginning but better late than never). The problem that I'm facing is that when I've chosen some node that lacks brand tagging in osm.org, then click on edit (to open it in iD). It takes good 30-40 seconds or more for the validator to detect that this element needs upgrading.

westnordost commented 2 years ago

(Copied over from duplicate ticket #6146)

The presets.json in the distribution is quickly getting too large. Just this year, it doubled in size compared to beginning of 2021.

I noticed some "out of memory" issues now just parsing this file on old Android devices (StreetComplete uses the presets.json from name-suggestion-index). At the rate this file is growing, it may soon become not worthwile to include it at all.

I can think of the following solutions:

  1. change the format in which those are distributed to for example a sqlite table. After all, this is structured data and not a free-form object graph. Though, at the rate this index is growing, it may only result in a temporary reduction. Also, this will not reduce the size it has in memory
  2. Split it into several files. One large file that includes international presets and then for each country additional presets that are to be found in that country (only). Though, I have no overview how much that would really reduce this main/international preset file. Maybe many brands actually only available in one/few countries are not marked as such (yet). Maybe it would be worth it to write a script that looks for that by searching through taginfo/planet.
  3. Include counts/popularity field for each preset so that an automatic build process (of StreetComplete) could take only the X most popular brands
westnordost commented 2 years ago

On my suggestions for solutions, it seems point 1 can be safely disregarded at this point. As Bryan mentioned, zipped, it's a lot smaller and in the end, this data will reside in memory because of the various indexes that make searching through it fast.

On point 2, I think I remember once seeing some kind of automatically generated list of names per country of candidates for being included in the index. E.g. if there are 100 places with the same name and same main tag, it is likely this is a brand. A similar thing could be done to find already added presets that in reality only exist in 1 to few countries but haven't been marked so. This may greatly increase the number of presets that could be separated from the main presets.json.

westnordost commented 2 years ago

I created a small script to pull the presets.json apart (only if only include rules were in the locationSet, all others remain in base file).

File Size
presets-us.json 996 KB
presets-de.json 591 KB
presets.json 570 KB
presets-jp.json 416 KB
presets-gb.json 308 KB
presets-fx.json 275 KB
presets-fr.json 260 KB
presets-ca.json 256 KB
presets-cn.json 207 KB
presets-es.json 182 KB
presets-nl.json 182 KB
presets-ru.json 161 KB
presets-br.json 157 KB
presets-it.json 152 KB
presets-tw.json 149 KB
presets-at.json 145 KB
presets-ch.json 136 KB
presets-au.json 136 KB
presets-pl.json 134 KB
presets-be.json 114 KB
presets-in.json 97 KB
presets-no.json 90 KB
presets-us-or.geojson.json 88 KB
presets-se.json 78 KB
presets-cz.json 77 KB
presets-ua.json 73 KB
presets-sk.json 72 KB
presets-ie.json 71 KB
presets-nz.json 70 KB
presets-us-ca.geojson.json 66 KB
presets-by.json 62 KB
presets-mx.json 62 KB
presets-lu.json 60 KB
presets-pt.json 58 KB
presets-ph.json 57 KB
presets-my.json 57 KB
presets-us-tx.geojson.json 57 KB
presets-sa.json 56 KB
presets-fi.json 51 KB
presets-tr.json 49 KB
presets-ar.json 49 KB
presets-th.json 48 KB
presets-dk.json 48 KB
presets-ae.json 47 KB
presets-id.json 46 KB
presets-cl.json 45 KB
presets-sg.json 44 KB
presets-us-wa.geojson.json 44 KB
presets-bg.json 43 KB
presets-kr.json 43 KB
presets-hu.json 42 KB
presets-ro.json 40 KB
presets-gb-eng.json 39 KB
presets-hk.json 38 KB
presets-us-il.geojson.json 36 KB
presets-il.json 36 KB
presets-co.json 35 KB
presets-ca-bc.geojson.json 34 KB
presets-pe.json 34 KB
presets-ca-on.geojson.json 32 KB
presets-gr.json 30 KB
presets-ir.json 29 KB
presets-us-ny.geojson.json 27 KB
presets-us-fl.geojson.json 27 KB
presets-ca-qc.geojson.json 27 KB
presets-de-nw.geojson.json 26 KB
presets-za.json 25 KB
presets-us-va.geojson.json 25 KB
presets-hr.json 25 KB
presets-vn.json 24 KB
presets-de-by.geojson.json 24 KB
presets-kz.json 23 KB
presets-us-oh.geojson.json 23 KB
presets-us-pa.geojson.json 22 KB
presets-de-bw.geojson.json 22 KB
presets-ma.json 21 KB
presets-ec.json 20 KB
presets-us-mi.geojson.json 20 KB
presets-bh.json 19 KB
presets-bo.json 19 KB
presets-us-az.geojson.json 18 KB
presets-kw.json 18 KB
presets-rs.json 18 KB
presets-pk.json 18 KB
presets-tn.json 17 KB
presets-qa.json 17 KB
presets-lt.json 16 KB
presets-us-md.geojson.json 16 KB
presets-lv.json 16 KB
presets-gb-lon.geojson.json 16 KB
presets-ee.json 16 KB
presets-dz.json 15 KB
presets-us-wi.geojson.json 15 KB
presets-au-nsw.geojson.json 15 KB
presets-us-ct.geojson.json 15 KB
presets-gt.json 15 KB
presets-pa.json 15 KB
presets-ve.json 14 KB
presets-ca-ab.geojson.json 14 KB
presets-si.json 14 KB
presets-us-nj.geojson.json 14 KB
presets-us-ga.geojson.json 14 KB
presets-bd.json 14 KB
presets-cr.json 13 KB
presets-gb-sct.json 13 KB
presets-gb-nir.json 13 KB
presets-us-ma.geojson.json 13 KB
presets-us-co.geojson.json 12 KB
presets-om.json 12 KB
presets-lk.json 12 KB
presets-eg.json 12 KB
presets-uk.json 12 KB
presets-ci.json 12 KB
presets-us-mo.geojson.json 12 KB
presets-us-in.geojson.json 12 KB
presets-de-he.geojson.json 12 KB
presets-gb-wls.json 11 KB
presets-us-nv.geojson.json 11 KB
presets-de-rp.geojson.json 11 KB
presets-gh.json 11 KB
presets-us-mn.geojson.json 11 KB
presets-us-ar.geojson.json 11 KB
presets-sv.json 10 KB
presets-de-ni.geojson.json 10 KB
presets-ng.json 10 KB
presets-ba.json 10 KB
presets-us-hi.json 10 KB
presets-hn.json 10 KB
presets-us-ky.geojson.json 10 KB
presets-gb-east-midlands.geojson.json 9 KB
presets-us-de.geojson.json 9 KB
presets-us-ia.geojson.json 9 KB
presets-uy.json 9 KB
presets-cy.json 9 KB
presets-us-al.geojson.json 9 KB
presets-de-sn.geojson.json 9 KB
presets-pr.json 8 KB
presets-us-ok.geojson.json 8 KB
presets-ml.json 8 KB
presets-mo.json 8 KB
presets-new_york_city.geojson.json 8 KB
presets-au-vic.geojson.json 8 KB
presets-us-nh.geojson.json 8 KB
presets-md.json 8 KB
presets-bw.json 8 KB
presets-gb-som.geojson.json 8 KB
presets-na.json 8 KB
presets-is.json 8 KB
presets-us-wv.geojson.json 8 KB
presets-sn.json 8 KB
presets-py.json 8 KB
presets-de-bb.geojson.json 8 KB
presets-us-nc.geojson.json 8 KB
presets-ke.json 8 KB
presets-mm.json 8 KB
presets-ao.json 7 KB
presets-baltimore_and_dc.geojson.json 7 KB
presets-gb-south-west.geojson.json 7 KB
presets-150.json 7 KB
presets-de-sh.geojson.json 7 KB
presets-gb-east-england.geojson.json 7 KB
presets-gg.json 7 KB
presets-gb-south-east-coast.geojson.json 7 KB
presets-tz.json 7 KB
presets-zm.json 7 KB
presets-us-tn.geojson.json 7 KB
presets-do.json 7 KB
presets-jo.json 7 KB
presets-de-be.geojson.json 7 KB
presets-je.json 7 KB
presets-us-sc.geojson.json 7 KB
presets-us-ne.geojson.json 7 KB
presets-us-dc.geojson.json 6 KB
presets-us-id.geojson.json 6 KB
presets-cu.json 6 KB
presets-bj.json 6 KB
presets-de-mv.geojson.json 6 KB
presets-cd.json 6 KB
presets-ug.json 6 KB
presets-ca-nb.geojson.json 6 KB
presets-me.json 6 KB
presets-bf.json 6 KB
presets-au-tas.json 6 KB
presets-mz.json 6 KB
presets-ca-sk.geojson.json 6 KB
presets-us-ks.geojson.json 6 KB
presets-mk.json 6 KB
presets-us-nm.geojson.json 5 KB
presets-de-hh.geojson.json 5 KB
presets-cm.json 5 KB
presets-au-qld.geojson.json 5 KB
presets-tg.json 5 KB
presets-et.json 5 KB
presets-al.json 5 KB
presets-de-st.geojson.json 5 KB
presets-ca-mb.geojson.json 5 KB
presets-gb-west-midlands.geojson.json 5 KB
presets-kh.json 5 KB
presets-rw.json 5 KB
presets-ni.json 5 KB
presets-mt.json 5 KB
presets-am.json 5 KB
presets-ad.json 5 KB
presets-us-la.geojson.json 5 KB
presets-bn.json 5 KB
presets-li.json 5 KB
presets-us-ms.geojson.json 5 KB
presets-de-hb.geojson.json 5 KB
presets-gb-yorkshire.geojson.json 5 KB
presets-gb-north-west.geojson.json 5 KB
presets-ca-ns.geojson.json 4 KB
presets-us-ak.json 4 KB
presets-ye.json 4 KB
presets-mn.json 4 KB
presets-ge.json 4 KB
presets-tt.json 4 KB
presets-kg.json 4 KB
presets-gb-dor.geojson.json 4 KB
presets-nz-can.geojson.json 4 KB
presets-sl.json 4 KB
presets-lb.json 4 KB
presets-uz.json 4 KB
presets-im.json 4 KB
presets-gb-south-central.geojson.json 4 KB
presets-cg.json 4 KB
presets-gb-north-east.geojson.json 4 KB
presets-mu.json 4 KB
presets-de-th.geojson.json 4 KB
presets-us-nd.geojson.json 4 KB
presets-lr.json 4 KB
presets-ga.json 4 KB
presets-us-me.geojson.json 4 KB
presets-re.json 3 KB
presets-us-ri.geojson.json 3 KB
presets-us-ut.geojson.json 3 KB
presets-au-sa.geojson.json 3 KB
presets-ca-nt.geojson.json 3 KB
presets-ca-yt.geojson.json 3 KB
presets-np.json 3 KB
presets-gm.json 3 KB
presets-us-wy.geojson.json 3 KB
presets-eu.json 3 KB
presets-au-wa.geojson.json 3 KB
presets-us-mt.geojson.json 3 KB
presets-sz.json 3 KB
presets-gb-con.geojson.json 3 KB
presets-mv.json 3 KB
presets-bb.json 3 KB
presets-ne.json 3 KB
presets-gb-greater-manchester.geojson.json 3 KB
presets-gb-iow.geojson.json 3 KB
presets-iq.json 3 KB
presets-us-sd.geojson.json 3 KB
presets-mc.json 3 KB
presets-bm.json 3 KB
presets-td.json 3 KB
presets-ls.json 3 KB
presets-sd.json 2 KB
presets-az.json 2 KB
presets-gn.json 2 KB
presets-mg.json 2 KB
presets-peoples_united_bank_ct.geojson.json 2 KB
presets-ss.json 2 KB
presets-us-vt.geojson.json 2 KB
presets-nc.json 2 KB
presets-ly.json 2 KB
presets-pf.json 2 KB
presets-af.json 2 KB
presets-zw.json 2 KB
presets-mr.json 2 KB
presets-dj.json 2 KB
presets-gb-dev.geojson.json 2 KB
presets-bi.json 2 KB
presets-la.json 2 KB
presets-gi.json 2 KB
presets-aw.json 2 KB
presets-gy.json 2 KB
presets-mq.json 2 KB
presets-gp.json 2 KB
presets-southern_nevada.geojson.json 2 KB
presets-khm.json 2 KB
presets-washoe_county.geojson.json 2 KB
presets-sx.json 2 KB
presets-greater_dayton_regional_transit_authority.geojson.json 2 KB
presets-san_luis_obispo_county.geojson.json 2 KB
presets-cuyahoga_county.geojson.json 2 KB
presets-lausd_los_angeles.geojson.json 2 KB
presets-mp.json 2 KB
presets-ie-d.geojson.json 2 KB
presets-cat_hood_river.geojson.json 2 KB
presets-bz.json 2 KB
presets-crimea.json 2 KB
presets-de-sl.geojson.json 2 KB
presets-us-ak.geojson.json 1 KB
presets-mw.json 1 KB
presets-first_state_bank_ne_west.geojson.json 1 KB
presets-bs.json 1 KB
presets-first_bank_carolinas.geojson.json 1 KB
presets-sy.json 1 KB
presets-first_bank_western_us.geojson.json 1 KB
presets-metro_rta.geojson.json 1 KB
presets-gu.json 1 KB
presets-nz-ota.geojson.json 1 KB
presets-830.json 1 KB
presets-au-nt.geojson.json 1 KB
presets-nz-tas.geojson.json 1 KB
presets-tj.json 1 KB
presets-tucson.geojson.json 1 KB
presets-gd.json 1 KB
presets-nz-wgn.geojson.json 1 KB
presets-bq.json 1 KB
presets-cw.json 1 KB
presets-de-bw.json 1 KB
presets-kn.json 1 KB
presets-sc.json 1 KB
presets-nz-auk.geojson.json 1 KB
presets-first_state_bank_ne_east.geojson.json 1 KB
presets-first_state_bank_il.geojson.json 1 KB
presets-first_state_bank_mi.geojson.json 1 KB
presets-first_state_bank_tx.geojson.json 1 KB
presets-xk.json 1 KB
presets-first_state_bank_oh.geojson.json 1 KB
presets-us-ca-sanfrancisco.geojson.json 1 KB
presets-us-ca-sanjose.geojson.json 1 KB
presets-ps.json 1 KB
presets-us-ca-eastbay.geojson.json 1 KB
presets-florida_keys.geojson.json 1 KB
presets-gb-mik.geojson.json 1 KB
presets-151.json 1 KB
presets-london-cycles.geojson.json 1 KB
presets-ms.json 1 KB
presets-miltonkeynes-cycles.geojson.json 1 KB
presets-tk.json 1 KB
presets-id-jw.json 1 KB
presets-pi.json 1 KB
presets-stadtmobil-rhein-neckar.geojson.json 1 KB
presets-jm.json 1 KB
presets-stadtmobil-stuttgart.geojson.json 1 KB
presets-stadtmobil-karlsruhe.geojson.json 1 KB
presets-pg.json 1 KB
presets-stadtmobil-suedbaden.geojson.json 1 KB
presets-stadtmobil-rhein-main.geojson.json 1 KB
presets-stadtmobil-rhein-ruhr.geojson.json 1 KB
presets-fj.json 1 KB
presets-ag.json 1 KB
presets-lc.json 1 KB
presets-stadtmobil-hannover.geojson.json 1 KB
presets-ca-nl.geojson.json 1 KB
presets-ca-nu.geojson.json 1 KB
presets-ca-pe.geojson.json 1 KB
presets-stadtmobil-berlin.geojson.json 1 KB
presets-er.json 1 KB
presets-gf.json 1 KB
presets-mi.json 1 KB
presets-stadtmobil-trier.geojson.json 1 KB
presets-q3336843.json 1 KB
presets-gb-devon-cornwall.geojson.json 1 KB
presets-cv.json 1 KB
presets-gw.json 1 KB
presets-gb-abd.geojson.json 1 KB
presets-ra.json 1 KB
presets-gb-bir.geojson.json 1 KB
presets-gb-abe.geojson.json 1 KB
presets-vi.json 1 KB
presets-deu.json 1 KB
presets-fra.json 1 KB
presets-idn.json 1 KB
presets-aut.json 1 KB
presets-esp.json 1 KB
presets-konsum-leipzig.geojson.json 1 KB
presets-tl.json 1 KB
presets-konsum-dresden.geojson.json 1 KB
presets-ky.json 1 KB
presets-ja.json 1 KB
presets-ai.json 1 KB
presets-foodland_eastern_us.geojson.json 1 KB
presets-039.json 1 KB
presets-155.json 1 KB
presets-ic.json 1 KB
presets-029.json 1 KB
presets-ms,.json 1 KB
presets-so.json 1 KB
presets-fo.json 1 KB
presets-sm.json 1 KB
presets-gq.json 1 KB

In total, 9.30 MB. If everything was in the base file, it's 7.2 MB. So, there is some repetition, but as stated earlier, a distribution of these files can be packed, so the real difference is smaller.

And most important of all, if you are in France for example, the presets to load are only 830 KB, as opposed to 7.2 MB. So, this is a huge difference. Even for the US and Germany for which the most presets exist, the difference is still huge.

Edit: The script is here. I further tweaked it to throw away all presets the osmfeature library does not support anyway (those with locationSet = ...geojson etc): https://github.com/streetcomplete/StreetComplete/blob/split-brand-presets/buildSrc/src/main/java/UpdateNsiPresetsTask.kt

westnordost commented 2 years ago

I now implemented

Loading all the presets (normal presets + localization + international brand presets + presets of the country one is in) during startup now takes 0.6s compared to 2.7s before on my phone. So roughly 4 times faster now.

bhousel commented 2 years ago

Loading all the presets (normal presets + localization + international brand presets + presets of the country one is in) during startup now takes 0.6s compared to 2.7s before on my phone. So roughly 4 times faster now.

Is there anything for us to do on the NSI side?
I do think going forward we should trim down dist/nsi.json to only include the items that actually have wikidata tags. Currently it includes everything, which isn't really how I intended it to be.

westnordost commented 2 years ago

From my end, no, because I already wrote the script to take apart the nsi.json myself. That script contains additional stuff - as outlined in my last message - that will probably not be done if the files were already separate this way in the dist folder.