Open sweetiepiggy opened 12 years ago
Been thinking about this for sometime, this is truly the last big nontrivial problem to solve. There is a old solution, it involve a self referencing database. But that is ugly because 1) it means we going to maintain our own backend 2) it is hard for search to index
Currently what stopping me 1) the scraper, does it mean we going to scrape 2 time 2) the database store 2 time, 3) the search engine index twice. Then how do they link to each other.
That is just the content that we scrape. How about the static part of the page. That have to get a gettext based infra, which bottle is extremely suck at.
I got some idea how, but how to do it nicely without breaking stuff......
One more thing, we didn't really scrape the malay verion.
I was thinking that you could just add a few columns to the existing database.
--- a/billwatcher/models.py +++ b/billwatcher/models.py @@ -37,6 +37,7 @@ class Bill(Mixin, Base): id = Column(Integer, autoincrement=True, primary_key=True) name = Column(String) long_name = Column(String) + ms_long_name = Column(String) bill_revs = relationship('BillRevision', backref='bill', order_by='desc(BillRevision.year)') @@ -46,7 +47,9 @@ class BillRevision(Mixin, Base): id = Column(Integer, autoincrement=True, primary_key=True) url = Column(String) + ms_url = Column(String) status = Column(String) + ms_status = Column(String) year = Column(Integer) read_by = Column(String) supported_by = Column(String)
It could be used like this:
--- a/billwatcher/pages.py +++ b/billwatcher/pages.py @@ -71,12 +71,15 @@ def feed(): _title = bill.long_name _description = "year: %s\n" \ "status: %s\n" \ + "ms_status: %s\n" \ "url: %s\n" \ + "ms_url: %s\n" \ "name: %s\n" \ + "ms_long_name: %s\n" \ "read_by: %s\n" \ "supported_by: %s\n" \ "date_presented: %s" % \ - (_rev.year, _rev.status, _rev.url, bill.name, _rev.read_by, + (_rev.year, _rev.status, _rev.ms_status, _rev.url, _rev.ms_url, bill.name, bill.ms_long_name, _rev.read_by, _rev.supported_by, _rev.date_presented) _link = prefix + '/detail/%s/' % (_rev.id) _pubDate = _rev.update_date
Something like this to do the localization:
localized_long_name = long_name if use_english else ms_long_name # use localized_name in place of where long_name is used currently
You would need to scrape two times. I was thinking to scrape the English page once to get the initial data, then scrape the Malay page to update the data with the new ms_* columns above. Not sure how difficult this will be or if I'm missing / misunderstanding anything ... I definitely don't understand the complete process so maybe I'm oversimplifying.
--- a/billwatcher/loader.py +++ b/billwatcher/loader.py @@ -113,6 +113,12 @@ def load_data(): print message % (bill.long_name, rev.year, url) session.commit() + ms_bills = load_ms_page() + for b in ms_bills: + # get bill name, ms_long_name, ms_*, etc. + # find existing English bill with same name + # update ms_* columns in existing bill +
I will not be doing it that way la, what I would do is, have BM and English with a separate document. Add a few column, I think the name is the key that link these 2, and a new column language.
By this way, we will not break the column in the sqlite database that much. And if the technology is mature enough we can support other language too. Another bonus is, if we search on elasticsearch, we get the ID. And with the name(which is the id used in parliament). We can get different language by trigger a language key, and point to the language that we want.
Another bonus of using a language key is, scraping don't need to change that much, just decide which run is for which language, and save that as the key.
Then we need to do the template, which is a bit hard, need to manipulate the dictionary abit....
I think a better solution is to, add one more column, language, and possible one more for the key, i don't think i need that, but the term name is a bit confusing.
Then switch a ui switch that pass the language key to session to decide which key to pick.
Template is something I am thinking, tell me if you have an idea.
The webpage being scraped is available in English and Malay. It would be nice if the rss feed could be updated to contain malay_long_name, malay_status, etc.