pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
26 stars 42 forks source link

reduce memory consumption? #509

Open missinglink opened 4 years ago

missinglink commented 4 years ago

I was looking over the code today and noticed this block introduced in https://github.com/pelias/whosonfirst/commit/c500aaf3e77e6698a3f5966f2859e574e288e2dd which stores admin records in memory so they can later be used for computing hierarchies.

It seems as though the entire geojson record is being stored in RAM when we probably only need to store the wofRecord.hierarchies path which is used by the hierarchyFinder.js code.

@orangejulius I'm not that familiar, does that sound like a reasonable change to reduce RAM usage?

orangejulius commented 4 years ago

I don't think there are any major wins to be had here. The extractFields pipeline step runs a bit earlier, so it prunes out anything really not needed:

https://github.com/pelias/whosonfirst/blob/master/src/readStream.js#L56

Looking at one of the records as its stored in memory:

{
  id: 85633729,
  name: 'Puerto Rico',
  name_aliases: [
    'Puerto Rico',
    'Commonwealth of Puerto Rico',
    'Puerto Rican',
    'Puertorico'
  ],
  name_langs: {
    am: [ 'ፕዌርቶ ሪኮ' ],
    ar: [ 'بورتوريكو' ],
    ay: [ 'Burinkin' ],
    az: [ 'Puerto Riko' ],
    ba: [ 'Пуэрто-Рико' ],
    be: [ 'Пуэрта-Рыка' ],
    bn: [ 'পুয়ের্তো রিকো' ],
    bs: [ 'Porto Riko' ],
    bg: [ 'Пуерто Рико' ],
    cs: [ 'Portoriko' ],
    ce: [ 'Пуэрто-Рико' ],
    cv: [ 'Пуэрто-Рико' ],
    dv: [ 'ޕުއެރްތޮ ރީކޯ' ],
    el: [ 'Πουέρτο Ρίκο' ],
    en: [ 'Commonwealth of Puerto Rico', 'Puerto Rican', 'Puertorico' ],
    eo: [ 'Porto-Riko' ],
    fo: [ 'Puerto Riko' ],
    fa: [ 'پورتوریکو' ],
    fr: [ 'Porto Rico' ],
    fy: [ 'Puerto Riko' ],
    ga: [ 'Pórtó Ríce' ],
    gl: [ 'Porto Rico' ],
    gv: [ 'Yn Phurt Verçhagh' ],
    gu: [ 'પ્યુએર્ટો રિકો' ],
    ht: [ 'Pòtoriko' ],
    he: [ 'פוארטו ריקו' ],
    hi: [ 'प्युर्तो रिको', 'पोर्टो रीको' ],
    hr: [ 'Portoriko' ],
    hy: [ 'Պուերտո Ռիկո' ],
    io: [ 'Portuo Riko' ],
    ia: [ 'Porto Rico' ],
    id: [ 'Puerto Riko' ],
    is: [ 'Púertó Ríkó' ],
    it: [ 'Porto Rico', 'Portorico' ],
    jv: [ 'Puerto Riko' ],
    ja: [ 'プエルトリコ' ],
    kn: [ 'ಪೋರ್ಟೊ ರಿಕೊ' ],
    ka: [ 'პუერტო-რიკო' ],
    kk: [ 'Пуэрто-Рико' ],
    rw: [ 'Puwerito Riko' ],
    ky: [ 'Пуэрто-Рико' ],
    ko: [ '푸에르토리코' ],
    ku: [ 'Porto Rîko' ],
    la: [ 'Portus Dives' ],
    lv: [ 'Puertoriko' ],
    lt: [ 'Puerto Rikas' ],
    ml: [ 'പോർട്ടോ റിക്കോ' ],
    mr: [ 'पोर्तो रिको' ],
    mk: [ 'Порторико' ],
    mt: [ 'Porto Riku' ],
    ne: [ 'प्युर्तो रिको' ],
    os: [ 'Пуэрто-Рико' ],
    pa: [ 'ਪੁਏਰਤੋ ਰੀਕੋ' ],
    pl: [ 'Portoryko' ],
    pt: [ 'Porto Rico' ],
    qu: [ 'Burinkin' ],
    ru: [ 'Пуэрто-Рико' ],
    si: [ 'පුවටෝ රිකෝ' ],
    sk: [ 'Portoriko' ],
    sl: [ 'Portoriko' ],
    so: [ 'Buerto Riko' ],
    sq: [ 'Portorikoja' ],
    sr: [ 'Порторико' ],
    su: [ 'Puérto Riko' ],
    ta: [ 'புவேர்ட்டோ ரிக்கோ' ],
    tt: [ 'Пуэрто-Рико' ],
    te: [ 'ఫ్యూర్టో రికో' ],
    tg: [ 'Пуэрто Рико' ],
    th: [ 'ปวยร์โตรีโก' ],
    tk: [ 'Puerto-riko' ],
    tr: [ 'Porto Riko' ],
    ug: [ 'پوئېرتو رىكو' ],
    uk: [ 'Пуерто-Рико' ],
    ur: [ 'پورٹو ریکو' ],
    uz: [ 'Puerto-Riko' ],
    vo: [ 'Puertorikeäns' ],
    wa: [ 'Porto Rico' ],
    wo: [ 'Poortorikoo' ],
    yo: [ 'Púẹ́rtò Ríkò' ],
    zh: [ '波多黎各自由邦', '波多黎各' ]
  },
  abbreviation: 'PR',
  place_type: 'dependency',
  lat: 18.234668,
  lon: -66.481065,
  bounding_box: '-67.937815,17.922919,-65.244618,18.522773',
  population: 3971020,
  popularity: undefined,
  hierarchies: [
    {
      continent_id: 102191575,
      dependency_id: 85633729,
      empire_id: 136253057
    }
  ]
}

There's a lot, but its mostly names. While we could prune out some unneeded languages, we still need the names to set the parent fields later on. I think the only field we could truly remove is the bounding_box. Even popularity could possibly be useful to do things in the future like increase the popularity of neighbourhoods in popular cities.

Overall, I don't believe memory usage of this importer is much of an issue right now, and I don't see any super compelling benefits to trying to prune out stuff we don't need.

missinglink commented 4 years ago

Agh ok, that makes sense, I didn't realize the name fields were also required but makes sense once you mention it.

At some point we can probably migrate this code to use the ancestors and names tables of the SQLite db.

I'm going to mark this as help wanted, we would be happy to merge a PR to that effect :rocket: