Open wejradford opened 10 years ago
/data1/gigacluster/clustering-0.4-0.4/*.clusters.stem_subsets
now holds an attempt at pulling some more useful pairs for this task. With the following heuristics, the maximum yield from the clusters processed so far is ~24000 pairs:
Examples from 1994 include:
Berringer threw two interceptions in the scrimmage .
Berringer , who threw two interceptions in the scrimmage , took the decision in stride .
An international spokesman for Sephardic Jews , he was a world-renowned scholar on their history and interpretation of Jewish law .
Goan was an international spokesman for Sephardic Jews -- descendants of those who fled the Spanish Inquisition in 1492 -- and a world-renowned scholar on their history and interpretation of Jewish law .
WDYT, @dominickng?
(one likely change is to use absolute sentence length difference, not relative)
Here are a few notes:
I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.
Exact copies are surprising, given the double-length heuristic...
On 11 May 2014 17:41, Dominick Ng notifications@github.com wrote:
Here are a few notes:
- Some pairs are exact copies of one another (forbid exact matching?)
- Some pairs are exact copies of quotes, with varying permutations:
- one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')
- one with surrounding quotes marks, and the other without
- Some matches are the city/location lines (LOS ANGELES AFP)
- I tried a very simple length filter and got:
- pairs with one sentence of 10 tokens or less: 523
- pairs with one sentence of 15 tokens or less: 1003
I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.
— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42764316 .
(Not double, sorry, but times some factor)
On 11 May 2014 18:59, Joel Nothman jnothman@student.usyd.edu.au wrote:
Exact copies are surprising, given the double-length heuristic...
On 11 May 2014 17:41, Dominick Ng notifications@github.com wrote:
Here are a few notes:
- Some pairs are exact copies of one another (forbid exact matching?)
- Some pairs are exact copies of quotes, with varying permutations:
- one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')
- one with surrounding quotes marks, and the other without
- Some matches are the city/location lines (LOS ANGELES AFP)
- I tried a very simple length filter and got:
- pairs with one sentence of 10 tokens or less: 523
- pairs with one sentence of 15 tokens or less: 1003
I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.
— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42764316 .
Some examples are
`` The questions cannot only be solved by computers .
" The questions cannot only be solved by computers ...\n
` You should n't talk about retirement 10 minutes after the game . ''
" You should n\'t talk about retirement 10 minutes after the game . "\n
`` In fact , she is the daughter of one of our generals .
In fact , she is the daughter of one of our generals .\n
`` This man is a hero .
" This man is a hero .\n
Right now the problem for me is that the pairs mostly seem to fall into these categories:
Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange
I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.
I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.
Are you sure you're looking at the right data? None of the examples you've
cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets
matches
nothing...
On 11 May 2014 19:44, Dominick Ng notifications@github.com wrote:
Right now the problem for me is that the pairs mostly seem to fall into these categories:
- no useful differences between the pairs (duplication)
- no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)
- too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.
Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday . Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange
I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.
I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.
— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42766335 .
I see, I'm looking at will's original data set. I'll redo the analysis
On Sunday, 11 May 2014, jnothman notifications@github.com wrote:
Are you sure you're looking at the right data? None of the examples you've cited should match these criteria:
grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets
matches nothing...On 11 May 2014 19:44, Dominick Ng notifications@github.com<javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
Right now the problem for me is that the pairs mostly seem to fall into these categories:
- no useful differences between the pairs (duplication)
- no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)
- too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.
Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday . Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange
I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.
I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.
— Reply to this email directly or view it on GitHub< https://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42766335> .
— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42768340 .
-Dom (on iPhone)
too much difference between the pairs that might require synonym/etc. machinery to distinguish,
Well, if that's the path you need to go down, there are probably ways to do it, as long as we have are able to learn with constraints that aren't fully lexicalised.
I see, I'm looking at will's original data set. I'll redo the analysis
Not that you won't see similarly useless/futile pairs!
Having finally got a Py3k virtualenv up with the relevant dependencies, I've commited this script in b1e885f
We need to adjust and calibrate an overlap that finds short sentences, stems the tokens, and identifies longer sentences that contain as many tokens as possible. We may consider an IDF weighting, or upweighting capitalised sequences.
This will hopefully identify useful pairs for @dominickng's work.