ארכיטקטורה חדשה לתמיכה בריבוי רשויות מקומיות

niryariv commented 10 years ago

@alonisser @florpor @Oreniko

בזמן שעבדנו על תב"ע פתוחה, השתמשתי תקופה מסוימת בארכיטקטורה קצת אחרת מזו שמשמשת היום. נזכרתי בה עכשיו כי היא יכולה אולי לעשות את החיים שלנו פשוטים יותר כשאנחנו באים להוסיף תמיכה בעוד רשויות.

כעיקרון הרעיון מתבסס על שתי הנחות:

הנתונים משתנים רק פעם ביום
המידע, והשאילתות שאנחנו מריצים עליו, מוגבלים יחסית

זה מוביל למבנה שבו כל התוצאות של השאילתות (שכולן בעצם "תוכניות לגוש איקס") מאוחסנות בתור קבצי ג'ייסון בלקוח, כמו כאן:

https://github.com/niryariv/opentaba-client/tree/master/gush

השרת מצדו מתרכז בסריקה חד-יומית של אתר ממ"י, שממנה הוא מייצר את הקבצים לעיל ודוחף אותם לגיטהאב. העבודה דרך גיט מונעת בזבוז רוחב פס על העברת מידע שלא השתנה, מאפשרת גיבוי ושחזור מידע וכן הלאה.

יתרונות השיטה הנ"ל:

שרת אחד ודטבייס חינמי כנראה יספיקו כדי לתמוך בכל הרשויות
פחות חלקים שיכולים להכשל (לא נורא אם השרת נופל לכמה ימים..)
גיבוי מידע בתוך המערכת

חסרונות:

מערכת פחות גמישה, סביר שפחות נוחה להוספת פיצ'רים
לא בטוח אם ניתן למימוש על הירוקו (צריך לבדוק לגבי רכיב גיט בהירוקו)

הקוד שמממש את המערכת הנ"ל כבר קיים, לא נקי במיוחד אבל עובד

florpor commented 10 years ago

נשמע טוב! אני ממש בעד למצוא דרך שלא נצטרך לעבוד עם אפליקציה בהירוקו וחשבון מונגו עבור כל עיר (כרגע יש 33). וזה גם לא מכביד על הלקוח יותר מהמימוש הקיים כשלכל גוש יש ג'ייסון משלו. נדבר על זה מחר?

דרך אגב, אנחנו מדברים פה עברית או אנגלית?

niryariv commented 10 years ago

בד"כ מדברים אנגלית, כי גיטהאב לא מסתדר עם עברית ואנגלית באותו משפט, אבל בגלל שזה רעיון חדש וטקסט ארוך, הלכתי עם עברית.

בהחלט נדבר על זה מחר - כתבתי היום כדי שכולם יספיקו לחשוב על נקודות חולשה של הגישה הזו עד שנפגש

alonisser commented 10 years ago

אני תוהה כמה זה מכביד על הקליינט, ההורדה המאסיבית של קבצי json עם מידע הגושים. לדעתי מאוד. גם אם זה מאוכסן בgithub, עדיין זה מוריד הכל לדפדפן של המשתמש. וזה נשמע לי ממש לא מוצלח. כאמור צריך לבדוק עם Yslow או משהו בסגנון ולראות מה התוצאה.
לדעתי גם שימוש בLocalstorage לא יפתור את זה, גם כי הוא סינכרוני ביישום שלו היום בדפדפנים, אז זו תהיה המתנה די תוקעת. וגם כי יש לו הגבלה של 4mb לדומיין. בגדול, להוציא בכרום (וגם זה באג משעשע שבתהליך תיקון), ההגבלה היא לדומיין על וכוללת את הsubdomains בתוכה.
זה מוריד את היכולת לבצע סינונים בשרת (למשל עם הגדרת הthreshold של מספר הגושים לתבע) - כי נדחף ישר החוצה.
שתדעו, יש דרך לאתחל apps בהירוקו עם ממשק תכנותי כך שאפשר לכתוב סקריפט הקמה לאתר שכולל את השלב הזה. כאמור לפני כן בהתכתבות, יש גם דרך שיהיה מספר גדול של apps. זה דורש משתמש ששם כרטיס אשראי, גם אם האתרים לא בתשלום

נדבר..

niryariv commented 10 years ago

לא הסברתי נכון את הקטע הזה: אנחנו לא מורידים את הקבצים לקליינט, אלא הקליינט קורא לקובץ הרלוונטי כל פעם שיש צורך - בדיוק כמו העבודה היום מול השרת, רק שבמקום לקרוא להירוקו הוא קורא לגיטהאב.

למעשה כל הקבצים כבר קיימים, אאל"ט אפשר להעביר את הקליינט כמו שהוא לעבוד מולם אם מפעילים את השורה הזו: https://github.com/niryariv/opentaba-client/blob/master/app.js#L11

(ההערה שם מתקופה שעבדתי על הקוד לבד ולא חשבתי שמישהו יקרא אותו :)

florpor commented 10 years ago

There's API in github for adding/updating/whatever files, so worst case scenario we can write some code to work with this... http://developer.github.com/v3/repos/contents/

shevron commented 10 years ago

One issue I thought of is how well can we control the Content-type and Expires headers sent from the server. These may have quite an effect on how well things work for some clients.

For Expires its mostly a server-configuration issue and will be controlled by whatever hosting environment we use (GitHub, Heroku, Amazon etc.). Expires is good but we should try keep it to a reasonable low value (e.g. 1 or 2 hours). Otherwise caching should be controlled by reasonable Last-Modified or ETag headers and be re-validation based.

For Content-type, usually servers either decide that based on the file extension (which our API does not have) or by server configuration. A little bit of Googling found this: http://taylor.fausak.me/2012/04/26/serving-atom-feeds-with-github-pages/ which actually highlights the problem and solution pretty well for GitHub pages.

So perhaps we should consider changing our routes (or adding alias routes) that add a file extension like .json and .atom to our APIs.

BTW the same blog post above shows an Expires header returned from GitHub of exactly 24 hours after file change. This may be suitable for us but we need to make sure it won't cause issues.

niryariv commented 10 years ago

sounds good to me, it makes sense for the files to have .json/.atom extension even when working with the server the current way.

alonisser commented 10 years ago

@shevron about the Expires date and keeping it reasonable low: From what I know about this the usual performance recommendation is the opposite for static data. The question is how static is the data..maybe since we use static file as a 'db', and we can't follow the change static file name on change we do need to use a low Expires. Anyway - a 24 hours cycle is fine for us.

+1 for the route change

shevron commented 10 years ago

I started playing around with this a bit (more of an experimental phase for now) and thought of a potential issue: depending on how github pages stores and serves files, having tens of thousands of files or directories (one per gush_id) as in https://github.com/niryariv/opentaba-client/tree/master/gush may be very bad for scalability depending on the filesystem used - for example ext3 is known to be problematic in this case while xfs is not (if it is a real and not some kind of virtual file system). Of course this is internal to github pages and we can't know in advance without testing.

The standard solution for this issue is hashing files into structures such as:

gush/28/28046
gush/28/28047
gush/28/...
gush/29/29501
gush/29/29502
...

and so on based on the first 2 digits of the gush ID. This way we get up to 1000 files per dir, no more. However, it will change the way URLs are constructed for the API, and may be a premature optimization.

Thoughts?

alonisser commented 10 years ago

Since We are not sure github api even supports this setup (mor is testing this) and we may resort back to using mongoDB then yes, I believe this is premature..

niryariv commented 10 years ago

@florpor sounds good to me. another option would be to put the gushim in directories based on municipality, eg:

/data/gushim/jerusalem/30035
/data/gushim/jerusalem/30036
...

@alonisser i dont understand - what are we not sure about, regarding the github api?

alonisser commented 10 years ago

first, you pinged mor instead of shahar..

about your question: the github api seems to be limited, from what mor experimented with it, and doesn't let us do some of the things we needs in order to keep the side gushim data updated.. etc. better that @florpor would update about this

Twitter:@alonisser https://twitter.com/alonisser LinkedIn Profile http://www.linkedin.com/in/alonisser Facebook https://www.facebook.com/alonisser _Tech blog:_4p-tech.co.il/blog _Personal Blog:_degeladom.wordpress.com Tel:972-54-6734469

On Mon, Dec 23, 2013 at 10:39 PM, Nir Yariv notifications@github.comwrote:

@florpor https://github.com/florpor sounds good to me. another option would be to put the gushim in directories based on municipality, eg:

/data/gushim/jerusalem/30035 /data/gushim/jerusalem/30036 ...

@alonisser https://github.com/alonisser i dont understand - what are we not sure about, regarding the github api?

— Reply to this email directly or view it on GitHubhttps://github.com/niryariv/opentaba-client/issues/30#issuecomment-31142535 .

shevron commented 10 years ago

@niryariv I agree that /data/gushim/<city code>/<gush id> makes more sense than some kind of hashed number prefixes. Do you have any idea on the max # of gushim per city? I suppose not more than a few hundreds?

@alonisser @florpor where are the github api issues documented? (or are they not?) I assumed that in order to update our data we just need to commit changes to the gh-pages branch or so?

alonisser commented 10 years ago

@shevron - about the number of gushim per city - @florpor has the data

about the github issues, not documented AFAIK, @florpor told me about that last time we met. the problem is updating the data, fetching from github with the api and then posting back.

niryariv commented 10 years ago

@shevron in JLM we have <500 gushim. that's likely the highest # of gushim in any municipality - even if not, I doubt think we'll see anything much bigger

shevron commented 10 years ago

So just to make sure, if we move forward with this, the current documented API urls:

/gush/30035 - get gush info
/gush/30035/plans - get gush plans

Will be changed to something like:

/gush/jerusalem/30035.json - get gush info
/gush/jerusalem/30035/plans.json - get gush plans

Also, we will most likely need an API to get the list of municipalities available, and the list of gushim in each one, so fetching data from us could be automated as well. something like:

/munis.json - get list of municipalities
/gush/jerusalem.json - get list of gushim in a city

niryariv commented 10 years ago

Sounds right.

florpor commented 10 years ago

@shevron about the list of municipalities - it is waiting in my pull request as data/cities.js - https://github.com/florpor/opentaba-client/blob/master/data/cities.js

about my attempts to work with static files stored on github - i've had some trouble acquiring all the data and still working on that.. the batches should be planned (ie. a muni per hour or something) and errors should be handled better than what i do. i'm not on my computer so i can't update exactly whats going on, but the 'biggest' issue i had with the github api so far is that it won't allow me to download files over 2mb in size (which is a few of our muni gushim lists), so i download the raw version for this part. i didn't even try yet to store changes in github... that may pose a problem if the limit exists for uploading files too, because theoretically (dunno if it actually happens right now) a plan list for one gush can surpass 2mb.

@niryariv @shevron i don't think a directory structure based on muni is good, even if currently we can't see a possible scenario where one would have more than 1000 gushim. imagine we could one day want to create a concept site (haven't decided what for yet :)) of the entire gush dan. merging the data files is easy peasy, but the amount of gushim that would have to go in one directory would probably be pretty large..

and finally - why do we even need the 'gush/30035' api? as far as i can see it only gives data that's relevant to the scraper.. or is it meant so others can scrape us more easily?

niryariv commented 10 years ago

@florpor do you use the Github API or simply git itself, running on the server and pushing to github commandline? I had success with the latter (might be a problem w/ heroku though - but we could switch to some cheapo host or AWS free tier since in this scheme the server doesn't do very much)

Regarding the muni based structure and the gush/id API - I'm thinking of features such as drawing gushim in different colors based on the current activity there (eg a user could see from the homepage what gushim have plans in "awaiting approval" stage, or which had recent changes, or we could show some pop up on hover that shows the most recent plans etc).

For this we'd probably want something like /jerusalem/gushim.json API call - or maybe jerusalem.opentaba.info/gushim.json- instead of doing the 400 calls to get each gush, so it makes sense to group them by muni.

However this scheme sucks when one knows the gush ID but not the muni it belongs to, so it's not perfect.

Another possibility is just to leave the API as is and change it only if we find files-per-dir actually becomes an issue. From this it seems like the Ext32 issue stems from it reading 32K dir data at a time. There are 256 munis in the entire country, so it could take a while till have that many files.

Let's discuss on Monday if everyone's coming?

niryariv commented 10 years ago

on 2nd thought i just found this github dir with >5000 files (they take the static API approach we're discussing here) and while it works it's a pain to even view with the browser, so not fun for devs who need to work on it..

shevron commented 10 years ago

Following discussion with @niryariv @alonisser and myself (you snooze you lose) on Monday's meeting, some things have come to light:

We decided to simplify, not go for a static-files + gh-pages based implementation and stick with our DB + some sort of standard caching mechanism which will be easily supported in our hosting environment (Redis, Memcache or even local files based cache).
No need for municipality in URL, also 1000s of files in a directory is no longer a problem as we will not have direct URL -> static file mapping

So I will try to document in the next few days a suggestion for a cleaned-up API providing only the things we need right now (and a few near term future ones) which is based as much as possible on the current implementation. I will create a new issue for it in niryariv/opentaba-server and we can discuss it there.

On the architecture side we still need to figure out what to do with the DB. Our most immediate bottleneck is storage limit on the free MongoHQ account. It was suggested to move the server + DB to a free-tier AWS micro instance - I suppose we should decide this soon but I do not see it as a bottleneck for continuing to work assuming the storage problem will be somehow solved, as it does not impact our code design.

@niryariv I suggest closing this thread and opening specific ones with action items.

niryariv commented 10 years ago

Agreed. Closing this as there's no real need for it currently - in the spirit of "the simplest thing that works". Maybe we'll revisit at a later stage..

niryariv / opentaba-client

ארכיטקטורה חדשה לתמיכה בריבוי רשויות מקומיות #30