Open michaelsilver opened 9 years ago
To fix this:
page
table, issuing delete from page where page_namespace != 0;
create index title_index on page (page_title);
Thanks for taking this up! One additional potentially relevant diagnostic tidbit is that the order of fetched hypernyms is now different since PR #10. For example:
>>> from wikithingsdb import fetch
>>> fetch.hypernyms_of_article('Brooklyn Bridge')
{'bridge': ['thing', 'infrastructure', 'architectural structure', 'place', 'route of transportation', 'bridge'], 'nrhp': ['thing', 'architectural structure', 'place', 'building']}
>>> fetch.hypernyms_of_class('bridge')
['thing', 'infrastructure', 'architectural structure', 'place', 'route of transportation', 'bridge']
whereas before hypernyms were returned in ascending order (which is preferable), the same order which defexpand returns the hypernyms:
>>> from defexpand import infoclass
>>> ontology = infoclass.get_info_ontology()
>>> ontology.classes_above_infobox('bridge')
['Bridge', 'RouteOfTransportation', 'Infrastructure', 'ArchitecturalStructure', 'Place', 'owl:Thing']
Any idea why this changed?
EDIT: Is this related to the changes in #10? I can keep that PR open just in case. Let me know.
Rows in a database are not stored in order. There are no guarantees about the sort order of a query. Maybe you could add a sort_classes()
function to defexpand.
Hmm... were they stored in order before? If you try on Malta, with the old, incomplete database, all queries seems to return in order:
Malta:
>>> from wikithingsdb import fetch
>>> fetch.hypernyms_of_class("organization")
['organisation', 'agent', 'thing']
>>> fetch.hypernyms_of_class("basketball-club")
['basketball team', 'sports team', 'organisation', 'agent', 'thing']
Nauru:
>>> from wikithingsdb import fetch
>>> fetch.hypernyms_of_class("organization")
['agent', 'thing', 'organisation']
>>> fetch.hypernyms_of_class("basketball-club")
['agent', 'thing', 'organisation', 'sports team', 'basketball team']
This suggests to me that something about the changes in schemata or insertion method made it such that they are no longer ordered. Maybe I'm wrong, but go ahead and try comparing any of the following classes that are contained in Malta's database:
>>> fetch.classes_of_hypernym("owl:thing")
['organization', 'film', 'person', 'figure-skater', 'gaa-player', 'swimmer', 'scientist', 'glacier', 'football-biography', 'locomotive', 'london-station', 'football-club', 'college-coach', 'album', 'settlement', 'automobile', 'nfl-player', 'television-episode', 'newspaper', 'officeholder', 'historic-site', 'church', 'alpine-ski-racer', 'stadium', 'royalty', 'military-person', 'airport', 'artist', 'basketball-biography', 'musical-artist', 'video-game', 'nrhp', 'politician', 'horseracing-personality', 'martial-artist', 'school', 'mountain', 'company', 'aircraft-begin', 'aircraft-type', 'software', 'rugby-biography', 'christian-leader', 'station', 'book', 'military-conflict', 'basketball-club', 'rugby-union-biography', 'cricketer', 'election', 'military-unit', 'football-tournament-season', 'university', 'football-league-season', 'single', 'writer', 'racing-driver', 'hospital', 'sports-league', 'radio-station', 'ambassador', 'boxer', 'political-party', 'government-agency', 'noble', 'short-story', 'ncaa-team-season', 'college-football-player', 'song', 'road', 'economist', 'sportsperson', 'football-club-season']
I'm not sure what the point would be then to make a function in defexpand to sort the hypernyms. We might as well have defexpand then return the hypernyms and not fetch them from the database at all.
Ah, we're using bulk inserts now which might explain the bad ordering. Do you need the hypernyms to be sorted?
We might as well have defexpand then return the hypernyms and not fetch them from the database at all.
Why don't we do this? It seems like a simpler solution.
Please merge PR #10 for now, we can open a separate issue for this.
Unclear whether this is related to PR #10, but fetching takes a really long time. The test suite takes
1703.845s
~30min to compete. This prohibitively slow.Weirdly, the behavior is opposite to what was before.
fetch.redirects_of_article()
appears to run the fastest -- all other fetches take much longer.Alvaro further notices: