infolab-csail / wikithingsdb

A DB of Synonyms, Paraphrases, and Hypernyms for all Wiki Things (Articles)
3 stars 3 forks source link

Fetching is very slow #11

Open michaelsilver opened 8 years ago

michaelsilver commented 8 years ago

Unclear whether this is related to PR #10, but fetching takes a really long time. The test suite takes 1703.845s ~30min to compete. This prohibitively slow.

Weirdly, the behavior is opposite to what was before. fetch.redirects_of_article() appears to run the fastest -- all other fetches take much longer.

Alvaro further notices:

This query is super slow:

mysql> select page_title from page where page_title="Bill Clinton";

The page_title column is not indexed. We basically need to add a bunch of indexes, or try to use ids as filters. More work for later!

alvaromorales commented 8 years ago

To fix this:

michaelsilver commented 8 years ago

Thanks for taking this up! One additional potentially relevant diagnostic tidbit is that the order of fetched hypernyms is now different since PR #10. For example:

>>> from wikithingsdb import fetch
>>> fetch.hypernyms_of_article('Brooklyn Bridge')
{'bridge': ['thing', 'infrastructure', 'architectural structure', 'place', 'route of transportation', 'bridge'], 'nrhp': ['thing', 'architectural structure', 'place', 'building']}
>>> fetch.hypernyms_of_class('bridge')
['thing', 'infrastructure', 'architectural structure', 'place', 'route of transportation', 'bridge']

whereas before hypernyms were returned in ascending order (which is preferable), the same order which defexpand returns the hypernyms:

>>> from defexpand import infoclass
>>> ontology = infoclass.get_info_ontology()
>>> ontology.classes_above_infobox('bridge')
['Bridge', 'RouteOfTransportation', 'Infrastructure', 'ArchitecturalStructure', 'Place', 'owl:Thing']

Any idea why this changed?

EDIT: Is this related to the changes in #10? I can keep that PR open just in case. Let me know.

alvaromorales commented 8 years ago

Rows in a database are not stored in order. There are no guarantees about the sort order of a query. Maybe you could add a sort_classes() function to defexpand.

michaelsilver commented 8 years ago

Hmm... were they stored in order before? If you try on Malta, with the old, incomplete database, all queries seems to return in order:

Malta:

>>> from wikithingsdb import fetch
>>> fetch.hypernyms_of_class("organization")
['organisation', 'agent', 'thing']
>>> fetch.hypernyms_of_class("basketball-club")
['basketball team', 'sports team', 'organisation', 'agent', 'thing']

Nauru:

>>> from wikithingsdb import fetch
>>> fetch.hypernyms_of_class("organization")
['agent', 'thing', 'organisation']
>>> fetch.hypernyms_of_class("basketball-club")
['agent', 'thing', 'organisation', 'sports team', 'basketball team']

This suggests to me that something about the changes in schemata or insertion method made it such that they are no longer ordered. Maybe I'm wrong, but go ahead and try comparing any of the following classes that are contained in Malta's database:

>>> fetch.classes_of_hypernym("owl:thing")
['organization', 'film', 'person', 'figure-skater', 'gaa-player', 'swimmer', 'scientist', 'glacier', 'football-biography', 'locomotive', 'london-station', 'football-club', 'college-coach', 'album', 'settlement', 'automobile', 'nfl-player', 'television-episode', 'newspaper', 'officeholder', 'historic-site', 'church', 'alpine-ski-racer', 'stadium', 'royalty', 'military-person', 'airport', 'artist', 'basketball-biography', 'musical-artist', 'video-game', 'nrhp', 'politician', 'horseracing-personality', 'martial-artist', 'school', 'mountain', 'company', 'aircraft-begin', 'aircraft-type', 'software', 'rugby-biography', 'christian-leader', 'station', 'book', 'military-conflict', 'basketball-club', 'rugby-union-biography', 'cricketer', 'election', 'military-unit', 'football-tournament-season', 'university', 'football-league-season', 'single', 'writer', 'racing-driver', 'hospital', 'sports-league', 'radio-station', 'ambassador', 'boxer', 'political-party', 'government-agency', 'noble', 'short-story', 'ncaa-team-season', 'college-football-player', 'song', 'road', 'economist', 'sportsperson', 'football-club-season']

I'm not sure what the point would be then to make a function in defexpand to sort the hypernyms. We might as well have defexpand then return the hypernyms and not fetch them from the database at all.

alvaromorales commented 8 years ago

Ah, we're using bulk inserts now which might explain the bad ordering. Do you need the hypernyms to be sorted?

We might as well have defexpand then return the hypernyms and not fetch them from the database at all.

Why don't we do this? It seems like a simpler solution.

Please merge PR #10 for now, we can open a separate issue for this.