Given a word, like 'python' generate the list of candidate, like in Google 'python vs ...' .
Get all sentences containing the target words (python)
Classify them (first word = python, second word = last / first noun in the sentence, text = input sentence). OR Classify them iterating over all nouns in the sentence (first word = python, second word = i-th noun in the sentence, text = input sentence).
Rank the nouns found in the sentences by the total number of comparative sentences (with a high threshold). No normalization is needed - just take the raw sentence counts.
I query sentences containing with the following query: "text:(\<object> AND vs)" where \<object> is "python" for example
I take the nouns (NN) where the following pattern matches: (
I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...
Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.
Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left:
[('java', 23), ('ruby', 22), ('boa', 16), ('alligator', 15), ('net', 9), ('cat', 2), ('crocodile', 2), ('sas', 1), ('tiger', 1), ('lisp', 1), ('arc', 1), ('node', 1), ('stones', 1), ('octave', 1), ('deer', 1), ('gator', 1), ('scheme', 1)]
Wordnet seems not te be useful for the python example after filtering out all candidates, not containing a common hypernym to python only these are left:
For the second filter approach the following comparison candidates were selected:
['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia', 'alligator', 'qml', 'python programs', 'cat', 'deer', 'crocodile', 'octave', 'tiger', 'arc', 'sas', 'gator', 'aqueon', 'prothon', 'ruby ruby', 'stones', 'brython', 'ruby performance', 'gql', 'nspr', 'pycuda']
They are sorted by found comparative sentences for python and the candidate. If only candidates with more than 40 comparative sentences are shown, probably the best get presented:
['lisp', 'lua', 'scheme', 'perl', 'visual', 'jython', 'net', 'cython', 'ruby', 'java', 'javascript', 'php', 'node', 'boa', 'julia']
comparing with google:
Only r, c++, matlab and go are not found so 60% are found and in addition there are found some more, which could also be interesting.
I query sentences containing with the following query: "text:( AND vs)" where is "python" for example
I take the nouns (NN) where the following pattern matches: ( (vs|vs.) candidate | candidate (vs|vs.) )
Alone this two steps deliver quite good results for comparison candidates (for python):
[('perl', 22), ('java', 15), ('php', 13), ('ruby', 9), ('alligator', 7), ('c', 6), ('lua', 4), ('r', 3), ('julia', 2), ('c++', 2), ('haskell', 2), ('crocodile', 1), ('tiger', 1), ('cat', 1), ('deer', 1), ('kruger', 1), ('gator', 1), ('qml', 1), ('ptrace', 1), ('jlizard', 1), ('visual', 1), ('dog', 1), ('kc', 1), ('scheme', 1), ('javascript', 1)]
I still need to filter out some candidates like "kruger" do you think common hypernyms (wordnet) could be helpfull for this? Are there standard functions to get common hypernyms for words I only found one for synsets...
Another approach could be to query the sentences belonging to the object and a candidate and count the sentences classified as "BETTER" or "WORSE", but that is very costly.
Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.
About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.
Maybe it is something we have to do beforehand:
Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.
Thanks for the hint with JoBim text, I hope it is easy to get, how to use it.
the all files are big, but you can trim them considerably by sorting all the values by the scores and keeping some 20% of top entries and removing the remaining 80% of the word pairs.
About deploying, at the moment it is not realy operating in real time, it takes about 15 seconds to process step 1 and 2. The filtering (step 3) using the BoW classifier features set takes minutes.
ok. maybe later then
Maybe it is something we have to do beforehand:
Taking seed words from different domains and searching the comparation candidates. Afterwards, we continue with the candidates and so on. We then could save it to a DB or file system.
Do you maybe have an example of how to use the DT JoBIM?
That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.
No, in your case just download the files I gave the link to (a bunch of archives). You will get a huge set of triples word1:word2:similarity. I would index them using elastic search and use at stage 3. Th JoBimText model includes much-much more parts you do not need. This is the part called DT.
Do you maybe have an example of how to use the DT JoBIM?
That would be great, since that is the last part of the thesis I need to write about (and create content to write about) and I would like to send the first draft including this part to you at the end of this week.
Alright, thank you very much for the clarification, I though I have to understand how to set up and use JoBimText now.
I will have a look how good that works for filtering the candidates, thanks!
Given a word, like 'python' generate the list of candidate, like in Google 'python vs ...' .
Get all sentences containing the target words (python)
Classify them (first word = python, second word = last / first noun in the sentence, text = input sentence). OR Classify them iterating over all nouns in the sentence (first word = python, second word = i-th noun in the sentence, text = input sentence).
Rank the nouns found in the sentences by the total number of comparative sentences (with a high threshold). No normalization is needed - just take the raw sentence counts.