the-Fish2 / Optimizing-GloVe

0 stars 0 forks source link

Need to determine most effective clustering algo! #13

Closed the-Fish2 closed 2 years ago

the-Fish2 commented 2 years ago

Plan: To compare the outputs (also using code haha)

the-Fish2 commented 2 years ago

Ideal list (bash):

[['', 'our', 'on', 'no', 'company', 'By', 'we', 'us', 'your', 'my'], ['in', 'where', 'the', 'In', 'during', 'at', 'when', 'here', 'then', 'what'], ['for', 'in', 'For', 'as', 'but', 'the', 'where', 'In', 'during', 'at'], ['that', 'it', 'not', 'if', 'but', 'what', 'just', 'It', 'really', 'do'], ['is', 'was', 'are', "'re", "'m", 'now', 'had', 'were', 'been', 'came'], ['##', '5', '3', '2', '###', 'six', '1', '#,###', '##,###', '#.#'], ['The', 'This', 'That', 'A', 'It', 'which', 'But', 'And', 'If', 'that'], ['with', 'between', 'in', 'had', 'while', 'by', 'over', 'both', 'through', 'where'], ['said', 'says', 'told', 'added', 'according', 'think', 'does', 'know', 'also', 'had'], ['be', 'being', 'are', 'have', 'should', 'been', 'was', 'were', "'re", 'these'], ['from', 'in', 'after', 'where', 'through', 'the', 'In', 'during', 'at', 'before'], ['I', "'m", 'my', 'me', 'we', 'you', "'re", 'am', 'really', 'your'], ['he', 'He', 'him', 'his', 'she', 'I', 'She', 'They', 'But', 'but'], ['will', 'can', 'would', 'should', 'could', 'may', 'want', 'did', 'do', 'need'], ['has', 'had', 'been', 'have', "'ve", 'since', 'was', 'were', 'they', 'we'], ['####', 'since', '##', '1', 'year', 'last', 'been', 'has', 'after', '5'], ['an', 'another', 'this', 'the', 'A', 'was', 'one', 'next', 'every', 'that'], ['or', 'any', 'your', 'can', 'you', 'if', 'no', 'not', 'never', 'my'], ['their', 'they', 'them', 'our', 'its', 'your', 'we', 'They', 'do', 'us'], ['who', 'He', 'also', 'he', 'former', 'him', 'She', 'They', 'But', 'but'], ['$', 'million', '###,###', '#.##', '#.#', '##,###', 'billion', '#,###', '###', '##.#'], ['more', 'than', 'most', 'little', 'better', 'about', 'much', 'even', 'one', 'very'], ['up', 'down', 'out', 'off', 'around', 'back', 'into', 'on', 'over', 'where'], ['all', 'these', 'those', 'other', 'some', 'both', 'many', 'are', 'including', 'such'], ['two', 'three', 'four', 'five', 'six', 'few', '##', 'some', 'many', '5'], ['first', 'second', 'third', 'last', '##th', 'next', 'half', 'three', 'six', 'ago'], ['time', 'day', 'days', 'when', 'months', 'year', 'week', 'month', 'night', 'years'], ['We', 'They', 'we', 'You', 'If', 'I', 'But', 'they', 'And', 'our'], ['new', 'next', 'will', 'another', 'the', 'own', 'last', 'first', 'this', 'can'], ['her', 'she', 'his', 'She', 'my', 'him', 'he', 'I', 'He', 'their'], ['people', 'children', 'those', 'them', 'us', 'just', 'family', 'school', 'these', 'other'], ['there', 'There', 'no', 'here', 'going', 'we', 'But', 'And', 'It', 'If'], ['U.S.', 'American', 'world', 'country', 'billion', 'government', 'AP', 'our', 'most', 'one'], ['so', 'too', 'but', 'because', 'really', 'very', 'not', 'But', 'It', 'if'], ['like', 'really', 'think', 'just', 'do', 'want', 'so', 'I', 'very', 'know'], ['only', 'one', 'just', 'not', 'but', 'even', 'two', 'three', 'five', 'four'], ['percent', '%', '#.#', '##.#', 'million', 'billion', '#.##', '###,###', '###', '##'], ['get', 'got', 'go', 'come', 'do', 'just', 'came', 'went', 'had', "'ve"], ['game', 'games', 'play', 'season', 'players', 'team', 'points', 'win', 'go', 'year'], ['against', 'game', 'win', 'in', '#-#', 'case', 'games', 'play', 'season', 'players'], ['made', 'make', 'came', 'had', 'no', 'did', 'get', 'do', 'if', 'come'], ['state', 'State', 'government', 'country', 'city', 'local', 'U.S.', 'Saturday', 'former', 'officials'], ['well', 'as', 'good', 'much', 'such', 'better', 'As', 'so', 'but', 'great'], ['home', 'family', 'back', 'game', 'off', 'when', 'children', 'life', 'own', 'people'], ['way', 'how', 'going', 'it', 'really', 'so', 'what', 'if', 'want', 'do'], ['work', 'do', 'go', 'get', 'come', 'so', 'want', 'know', 'not', 'did'], ['take', 'took', 'go', 'put', 'get', 'come', 'went', 'came', 'got', 'had'], ['high', 'top', 'down', 'school', 'well', 'up', 'best', 'second', 'third', 'one'], ['still', 'now', 'but', 'even', 'just', 'only', 'right', 'is', "'re", 'because'], ['old', 'man', 'ago', 'year', 'last', 'who', 'him', 'he', 'police', 'his'], ['see', 'know', 'think', 'do', 'get', 'say', 'really', 'want', 'not', 'did'], ['business', 'company', 'companies', 'market', 'sales', 'services', 'its', 'government', 'quarter', 'percent'], ['under', 'put', 'on', 'into', 'without', 'the', 'come', 'go', 'take', 'get'], ['help', 'need', 'support', 'better', 'can', 'do', 'want', 'should', 'services', 'money'], ['end', 'start', 'until', 'point', 'next', 'back', 'go', 'early', 'run', 'before'], ['long', 'many', 'well', 'much', 'just', 'few', 'some', 'those', 'these', 'all'], ['information', 'services', 'report', 'money', 'For', 'any', 'service', 'companies', 'business', 'support'], ['part', 'this', 'because', 'the', 'also', 'really', 'another', 'that', 'last', 'it'], ['based', 'company', 'its', 'in', 'from', 'according', 'companies', 'business', 'market', 'sales'], ['left', 'right', 'went', 'took', 'came', 'back', 'now', 'just', 'going', 'do'], ['use', 'used', 'need', 'do', 'take', 'help', 'can', 'could', 'called', 'want'], ['today', 'Thursday', 'Wednesday', 'Monday', 'Tuesday', 'Friday', 'Saturday', 'Sunday'], ['same', 'every', 'each', 'this', 'the', 'only', 'another', 'one', 'all', 'other'], ['public', 'government', 'local', 'people', 'state', 'city', 'country', 'billion', 'companies', 'area'], ['set', 'put', 'come', 'start', 'the', 'for', 'go', 'take', 'get', 'came'], ['place', 'where', 'time', 'in', 'lead', 'the', 'when', 'here', 'then', 'what'], ['group', 'team', 'who', 'members', 'company', 'program', 'players', 'game', 'season', 'games'], ['found', 'were', 'according', 'say', 'see', 'was', 'are', 'had', 'been', 'have'], ['lot', 'some', 'little', 'really', 'much', 'great', 'many', 'few', 'those', 'these'], ['number', 'many', 'two', 'these', 'those', 'four', 'some', 'few', 'all', 'three'], ['##-##', '#-#', '##', '3', '##.#', '5', 'win', 'game', 'points', 'second'], ['show', 'see', 'say', 'know', 'so', 'program', 'think', 'do', 'get', 'not'], ['past', 'over', 'last', 'ago', 'years', 'have', 'around', 'between', 'off', 'first'], ['expected', 'will', 'could', 'next', 'would', 'going', 'can', 'should', 'may', 'last'], ['hit', 'off', 'left', 'run', 'went', 'came', 'down', 'out', 'back', 'up'], ['system', 'program', 'service', 'services', 'state', 'way', 'group', 'show', 'business', 'work'], ['pm', 'am', 'Saturday', 'Sunday', 'Friday', 'Thursday', "'m", 'I', "'re", 'is'], ['big', 'great', 'good', 'lot', 'really', 'little', 'best', 'better', 'some', 'much'], ['per', '#.##', '##.#', '#.#', '$', '###', '5', 'percent', '##', '%'], ['&', 'By', '#.##', '3', 'A', '2', 'by', 'AP', 'In', 'For']]

the-Fish2 commented 2 years ago

Birch List: Problem with birch: Each point used exactly once; implying that the semantle game where words can be part of multiple clusters doesn't work Also Euclidean dist, not cos

image

the-Fish2 commented 2 years ago

-

the-Fish2 commented 2 years ago

Agglomerative:

Allows cosine distance, but pretty bad. Words still can't be part of multiple clusters, slower than Birch, and a lot of somewhat isolated words are left completely alone, with other massive groups

image

the-Fish2 commented 2 years ago

DB Clustering: Allows cosine distance. Bit faster than birch. However, some good arrays but lots of huge arrays :(

image

the-Fish2 commented 2 years ago

selected: birch