infolab-csail / wikithingsdb

A DB of Synonyms, Paraphrases, and Hypernyms for all Wiki Things (Articles)
3 stars 3 forks source link

Add optional kwarg to limit number of articles returned by articles_of_class #16

Closed alvaromorales closed 8 years ago

alvaromorales commented 8 years ago

We can use WikithingsDB to get a list of articles with a certain class. For example:

from wikithingsdb.fetch import articles_of_class
articles_of_class('bridge')

This query returns all articles, and may take a long time. It would be nice to add a kwarg to limit the number of articles returned. For example:

articles_of_class('bridge', limit=10)

cc @TheRealAkhil

michaelsilver commented 8 years ago

Sure, why not, that could be added. But, for the record, articles_of_class used to be a fast function (see issue #11).

alvaromorales commented 8 years ago

It takes ~ 4 minutes to fetch 39768 articles with class officeholder.

I don't think this is a performance issue anymore – it's just that Wikipedia is huge. If I remember correctly, WikithingsDB used to be fast when it only had 678 articles.

michaelsilver commented 8 years ago

Looks like I might need to rebuild the database after adding the lazy='dynamic' to the relationships if I want to limit the number of WikiClass.page's returned. I'm trying to do something like this:

        result = session.query(WikiClass)\
                        .filter_by(class_name=w_class)\
                        .one()\
                        .page\
                        .limit(limit)

See http://stackoverflow.com/a/19233187 and http://stackoverflow.com/a/11579347. Otherwise, I get error like 'InstrumentedList' object has no attribute 'limit'.

@alvaromorales do you see any obvious ways around this?

alvaromorales commented 8 years ago

Using the limit function is a clean way to do this. You shouldn't need to rebuild the database, you just need to run a migration (schema change). Alembic seems to be the tool of choice.

You're using a list comprehension to return articles in articles_of_class. I thought we could just use enumerate in a for loop to limit the number of articles to return. But SQLAlchemy is actually executing the query against the database, getting all presidents, and then truncating the list to 10. It's still slow -- we need limit.