openai / deeptype

Code for the paper "DeepType: Multilingual Entity Linking by Neural Type System Evolution"
https://arxiv.org/abs/1802.01021
Other
645 stars 148 forks source link

Getting coverage for only 6 million wikidata items in evolved type #44

Closed heisenbugfix closed 5 years ago

heisenbugfix commented 5 years ago

I ran the evolved type system code and noted down the number of unique wikidata items which are classified as true in all the type axis combined. I found that it covers only 6 million wikidata ids out of ~40 million. Is there a workaround for this? @JonathanRaiman

JonathanRaiman commented 5 years ago

Many of the 40Million are either for non English wikipedias (so unless you make the evolution using a training set that includes other languages, they won't show), or purely data fields (e.g. proteins/enzymes which have unique Wikidata fields)

heisenbugfix commented 5 years ago

Thanks for clarifying it. I trained only on English. There is one more problem though. I found that in my evolved type system, the 6 million covered wikidata items are solely under 1 type axis. So let's say I have 210 type axis from evolved system then for 98% of the axis very few items are classified under them . Only for one or two axis , all of the items are classified as true. This highly affects the training. I am getting 99% accuracy but its actually not useful because for almost all the type axis the items are classified as false. Is there a way to cope this problem if I only want to train in English Language?

Concretely the following are my type axis along with number of items(total 6692308) classified as true: ('Q25233764', 'Category:Action (philosophy)', 6691548) ('Q4557', 'Category:History of Jordan', 6691548) ('Q5837719', 'Category:Hebrew Bible', 6691548) ('Q6327591', 'Category:Metropolitan areas', 6691548) ('Q7033664', 'Category:Spiral galaxies', 6691548) ('Q7012005', 'Category:Human resource management', 215730) ('Q7033172', 'Category:Rock music', 176839) ('Q5550263', 'Category:African-American culture', 144887) ('Q8399800', 'Category:Economy of Texas', 38927) ('Q8589728', 'Category:Lieutenancy areas of Scotland', 32726) ('Q7517245', 'Category:Cycling', 31795) ('Q8970424', 'Category:Pnictogens', 22562) ('Q7032111', 'Category:Proprietary software', 20098) ('Q13226090', 'Category:Science books', 18585) ('Q8520139', 'Category:History of the National Football League', 16984) ('Q7011471', 'Category:Rape', 10562) ('Q6383703', 'Category:CONCACAF', 9422) ('Q6940900', 'Category:Buddhism by country', 8928) ('Q8212228', 'Category:Videotelephony', 8703) ('Q8297968', 'Category:Berlin State Museums', 6554) ('Q9888388', 'Category:Official languages', 6516) ('Q7009202', 'Category:United States Department of Homeland Security', 6005) ('Q7207445', 'Category:Dutch nobility', 4594) ('Q8221533', 'Category:Administrative divisions of regions of Ukraine', 4032) ('Q6788715', 'Category:1783 in the United States', 3696) ('Q8504112', 'Category:Halides', 3609) ('Q9004945', 'Category:Congregation of Holy Cross', 3282) ('Q7372371', 'Category:Nimruz Province', 2951) ('Q7429332', 'Category:New York Giants', 2190) ('Q8138425', 'Category:1939 in Poland', 2002) ('Q28130974', 'Category:United States in the Korean War', 1838) ('Q6783836', 'Category:2010 in the United States', 1801) ('Q8134180', 'Category:Victoria Cross', 1503) ('Q9135', 'Operating system', 1382) ('Q7163454', 'Category:Cayman Islands', 1303) ('Q8350038', 'Category:Catskills', 1278) ('Q8518548', 'Category:History of Lincolnshire', 1235) ('Q6948351', 'Category:History of Haiti', 1149) ('Q18524342', 'Category:Librarianship and human rights', 1088) ('Q8506566', 'Category:Hazardous waste', 978) ('Q8191070', 'Category:2006 FIFA World Cup', 968) ('Q13277426', 'Category:Nine Network', 881) ('Q9756661', 'Category:Confidence tricks', 798) ('Q879146', 'Christian denomination', 725) ('Q6449270', 'Category:Oviedo', 549) ('Q8280686', 'Category:Avesta', 493) ('Q8686669', 'Category:Rugby league in Scotland', 479) ('Q7106653', 'Category:Iranian nationalism', 446) ('Q8159843', 'Category:Water spirits', 437) ('Q6726367', 'Category:Economy of Fiji', 435) ('Q8690398', 'Category:Nursing ethics', 421) ('Q8717294', 'Category:Science centers', 367) ('Q7485139', 'Category:Articulated vehicles', 360) ('Q8417291', 'Category:English language in England', 359) ('Q25218323', 'Category:Structural basins', 337) ('Q7215695', 'Category:Michael Jackson', 315) ('Q6583867', 'Category:Sengoku period', 260) ('Q6596924', 'Category:Bank robbery', 258) ('Q8719028', 'Category:Scottish Enlightenment', 258) ('Q8579473', 'Category:Labor rights', 235) ('Q7809740', 'Category:Yeasts', 221) ('Q8614236', 'Category:Quantum phases', 221) ('Q25209133', "Category:Incorporation of Tibet into the People's Republic of China", 204) ('Q8153798', 'Category:Watergate scandal', 195) ('Q8553920', 'Category:Irish Free State', 184) ('Q15318548', 'Category:Criticism of science', 169) ('Q185363', 'Chronicle', 168) ('Q8855695', 'Category:Toonami', 152) ('Q24896203', 'Category:Tulu Nadu', 147) ('Q8983024', 'Category:Addition reactions', 121) ('Q7930819', 'Category:Scombridae', 113) ('Q8317183', 'Category:Buchanan County, Iowa', 112) ('Q8317182', 'Category:Buchanan County, Virginia', 102) ('Q8547723', 'Category:Dwight D. Eisenhower School for National Security and Resource Strategy', 100) ('Q68', 'Computer', 96) ('Q8300782', 'Category:Biology of bipolar disorder', 95) ('Q7008098', 'Category:Heinrich Himmler', 88) ('Q25251114', 'Category:Land grants', 72) ('Q8689953', 'Category:Russell family', 66) ('Q15297587', 'Category:Public holidays in Ghana', 60) ('Q8579173', 'Category:LaSalle, Quebec', 58) ('Q20857142', 'Category:RC Kouba', 54) ('Q8364816', 'Category:De Gaulle family', 52) ('Q7142528', 'Category:1996 in spaceflight', 48) ('Q8220201', 'Category:Acid anhydrides', 47) ('Q20503015', 'Category:Downtown Dallas', 38) ('Q7010700', 'Category:Bureau of Alcohol, Tobacco, Firearms and Explosives', 36) ('Q164142', 'Protectorate', 33) ('Q8607767', 'Category:Maragatería', 33) ('Q8567757', 'Category:Johnston Atoll', 32) ('Q8575839', 'Category:Korean Air', 32) ('Q8639126', 'Category:Monohydroxybenzoic acids', 32) ('Q6433036', 'Category:Shia days of remembrance', 30) ('Q6800356', 'Category:Gapyeong County', 29) ('Q26047334', 'Category:Catholic ecclesiology', 28) ('Q7485144', 'Category:Turing machine', 26) ('Q7140976', 'Category:Corrosion inhibitors', 25) ('Q47069', 'Metamorphic rock', 23) ('Q36784', 'Regions of France', 20) ('Q8257688', 'Category:Appomattox Court House National Historical Park', 18) ('Q24070276', 'Category:The Pickwick Papers', 15) ('Q7329743', 'Category:Jacques Chirac', 14) ('Q7353766', 'Category:Pedro II of Brazil', 11) ('Q8172643', 'Category:Wakashan languages', 11) ('Q8432590', 'Category:Cubana de Aviación', 10) ('Q8659804', 'Category:Retrospective diagnosis', 10) ('Q19639743', 'Category:Teterboro, New Jersey', 8) ('Q221911', 'Parliamentary procedure', 8) ('Q5535082', 'Geographical database', 7) ('Q6633516', 'Category:451', 7) ('Q8805274', 'Category:Public holidays in Afghanistan', 7) ('Q160598', 'Heresy', 5) ('Q19788934', 'Category:Fiat Chrysler Automobiles', 5) ('Q3125101', 'Trial court', 5) ('Q505821', 'Greco-Roman mysteries', 5) ('Q208511', 'Global city', 4) ('Q16933701', 'Principal city', 2)

JonathanRaiman commented 5 years ago

@heisenbugfix You hit the nail on the head. The best metric to use here is F1 or some other F-measure. I've used this to balance off 'accuracy' with actual performance due to massive class imbalance. I recommend either weighing (dynamically or otherwise) the negative class down and the positive class up to counter this. The other strategy is to have multi-class types, e.g. find areas of mutual exclusivity and merge them.

heisenbugfix commented 5 years ago

Thanks a lot :)