uhh-lt / cam

The Comparative Argument Machine
http://ltdemos.informatik.uni-hamburg.de/cam/
MIT License
10 stars 4 forks source link

try to generate candidates for comparison #86

Closed alexanderpanchenko closed 4 years ago

alexanderpanchenko commented 6 years ago

Given a word, like 'python' generate the list of candidate, like in Google 'python vs ...' .

  1. Get all sentences containing the target words (python)

  2. Classify them (first word = python, second word = last / first noun in the sentence, text = input sentence). OR Classify them iterating over all nouns in the sentence (first word = python, second word = i-th noun in the sentence, text = input sentence).

  3. Rank the nouns found in the sentences by the total number of comparative sentences (with a high threshold). No normalization is needed - just take the raw sentence counts.

alexanderpanchenko commented 6 years ago

Caching mechanism is important not to re-compute everything from scratch every time.

mschildw commented 6 years ago

I now tried a different approach:

  1. I query sentences containing with the following query: "text:(\<object> AND vs)" where \<object> is "python" for example

  2. I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) ) Alone this two steps deliver quite good results for comparison candidates (for python): [('perl', 40), ('java', 23), ('ruby', 22), ('php', 19), ('boa', 16), ('alligator', 15), ('julia', 14), ('net', 9), ('c++', 6), ('visual', 5), ('javascript', 4), ('gatoroid', 2), ('crocodile', 2), ('ruby ruby', 2), ('matlab gc', 2), ('brython', 2), ('cat', 2), ('lua', 2), ('qml', 2), ('jython', 1), ('lisp', 1), ('arc', 1), ('tiger', 1), ('rhinoscript', 1), ("print 'weave", 1), ('matlab/eeglab', 1), ('node', 1), ('python programs', 1), ('aqueon', 1), ('africanized honeybee', 1), ('gator', 1), ('gql', 1), ('profiling pypy', 1), ('scheme', 1), ('alligator watch', 1), ('deer', 1), ('octave', 1), ('nspr', 1), ('stones', 1), ('jlizard', 1), ('thinking upside down ruby', 1), ('ruby deathmatch', 1), ('kruger', 1), ('ruby performance', 1), ('cockatoo photos', 1), ('python-novaclient', 1), ('prothon', 1), ('film boa', 1), ('cython', 1), ('sas', 1), ("print 'f2py", 1), ('pycuda', 1)]

  3. I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...

  4. Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.

    What do you think about the first 2 steps?

    mschildw commented 6 years ago

    Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left: [('java', 23), ('ruby', 22), ('boa', 16), ('alligator', 15), ('net', 9), ('cat', 2), ('crocodile', 2), ('sas', 1), ('tiger', 1), ('lisp', 1), ('arc', 1), ('node', 1), ('stones', 1), ('octave', 1), ('deer', 1), ('gator', 1), ('scheme', 1)]

    For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn

    alexanderpanchenko commented 6 years ago

    I would expect that WordNet is not useful - it coverage is quite limited.

    However, distributional models can be useful.

    Here you will find the word similarities (AKA distributional thesaurus JoBimText) computed exactly from our corpus.

    http://ltdata1.informatik.uni-hamburg.de/depcc/distributional-models/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-1000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/ http://ltdata1.informatik.uni-hamburg.de/depcc/distributional-models/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-1000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/

    You can try to use them to generate more candidates.

    On Oct 29, 2018, at 1:48 PM, Matthias Schildwächter notifications@github.com wrote:

    Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left:

    [('java', 13), ('alligator', 4), ('ruby', 4), ('scheme', 1), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('gator', 1), ('kc', 1), ('deer', 1)]

    For many candidates there were no hypernym at all (e.g. perl, lua, c++) as can also be viewed here: http://wordnetweb.princeton.edu/perl/webwn http://wordnetweb.princeton.edu/perl/webwn — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/cam/issues/86#issuecomment-433897166, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vgNzDbDOkOPYYcedNexcGN_blWQ-ks5upvk2gaJpZM4V0CMH.

    mschildw commented 6 years ago

    For the second filter approach the following comparison candidates were selected: ['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia', 'alligator', 'qml', 'python programs', 'cat', 'deer', 'crocodile', 'octave', 'tiger', 'arc', 'sas', 'gator', 'aqueon', 'prothon', 'ruby ruby', 'stones', 'brython', 'ruby performance', 'gql', 'nspr', 'pycuda']

    They are sorted by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the best get presented: ['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia']

    comparing with google: grafik

    Only r, c++, matlab and go are not found so 60% are found and in addition there are found some more, which could also be interesting.

    alexanderpanchenko commented 6 years ago

    I find this approach very interesting. Would be really great to show more examples of these…

    Use the DT JoBimText to filter in step 3 (see my other mail for details).

    On Oct 29, 2018, at 11:43 AM, Matthias Schildwächter notifications@github.com wrote:

    I now tried a different approach:

    I query sentences containing with the following query: "text:( AND vs)" where is "python" for example

    I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) ) Alone this two steps deliver quite good results for comparison candidates (for python): [('perl', 22), ('java', 15), ('php', 13), ('ruby', 9), ('alligator', 7), ('c', 6), ('lua', 4), ('r', 3), ('julia', 2), ('c++', 2), ('haskell', 2), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('deer', 1), ('kruger', 1), ('gator', 1), ('qml', 1), ('ptrace', 1), ('jlizard', 1), ('visual', 1), ('dog', 1), ('kc', 1), ('scheme', 1), ('javascript', 1)]

    I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...

    Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.

    What do you think about the first 2 steps?

    — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/cam/issues/86#issuecomment-433864498, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vl4qgdl16fvy5mPlOPciaFB68hSVks5uptvSgaJpZM4V0CMH.

    alexanderpanchenko commented 6 years ago

    looks quite good. actually i think that it is already worth deploying it (to see how it works more realistically and to be able to play with it…)

    On Oct 29, 2018, at 5:06 PM, Matthias Schildwächter notifications@github.com wrote:

    by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the bes

    mschildw commented 6 years ago

    Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.

    About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.

    Maybe it is something we have to do beforehand: Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.

    alexanderpanchenko commented 6 years ago

    On Oct 29, 2018, at 5:29 PM, Matthias Schildwächter notifications@github.com wrote:

    Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.

    the all files are big, but you can trim them considerably by sorting all the values by the scores and keeping some 20% of top entries and removing the remaining 80% of the word pairs. About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.

    ok. maybe later then Maybe it is something we have to do beforehand: Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.

    — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/cam/issues/86#issuecomment-433977396, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vinqGT8qCuj7m5QKyAU_Xy9yYUz1ks5upy0FgaJpZM4V0CMH.

    mschildw commented 6 years ago

    Do you maybe have an example of how to use the DT JoBIM? That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.

    I have to setup a local database to use it, right? The trim operation can achieved using this http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/pruning/ , right?

    alexanderpanchenko commented 6 years ago

    No, in your case just download the files I gave the link to (a bunch of archives). You will get a huge set of triples word1:word2:similarity. I would index them using elastic search and use at stage 3. Th JoBimText model includes much-much more parts you do not need. This is the part called DT.

    Sent from my iPhone

    On 29. Oct 2018, at 17:50, Matthias Schildwächter notifications@github.com wrote:

    Do you maybe have an example of how to use the DT JoBIM? That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.

    I have to setup a local database to use it, right? The trim operation can achieved using this http://ltmaggie.informatik.uni-hamburg.de/jobimtext/documentation/pruning/ , right?

    — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

    mschildw commented 6 years ago

    Alright, thank you very much for the clarification, I though I have to understand how to set up and use JoBimText now. I will have a look how good that works for filtering the candidates, thanks!