Finding short, simple sentences and their longer restatements

wejradford commented 10 years ago

We need to adjust and calibrate an overlap that finds short sentences, stems the tokens, and identifies longer sentences that contain as many tokens as possible. We may consider an IDF weighting, or upweighting capitalised sequences.

This will hopefully identify useful pairs for @dominickng's work.

jnothman commented 10 years ago

/data1/gigacluster/clustering-0.4-0.4/*.clusters.stem_subsets now holds an attempt at pulling some more useful pairs for this task. With the following heuristics, the maximum yield from the clusters processed so far is ~24000 pairs:

require sentence pair score >= 0.4
forbid one punctuation-removed, lowercased sentence being a substring of the other (avoids additions of attribution to quotes, etc.)
require that the set of lowercased, Porter-stemmed, stopped tokens of one sentence is a subset of the other
require that the ratio of number of tokens between unnormalised sentences is > 1.5

Examples from 1994 include:

Berringer threw two interceptions in the scrimmage .
Berringer , who threw two interceptions in the scrimmage , took the decision in stride .

An international spokesman for Sephardic Jews , he was a world-renowned scholar on their history and interpretation of Jewish law .
Goan was an international spokesman for Sephardic Jews -- descendants of those who fled the Spanish Inquisition in 1492 -- and a world-renowned scholar on their history and interpretation of Jewish law .

WDYT, @dominickng?

jnothman commented 10 years ago

(one likely change is to use absolute sentence length difference, not relative)

dominickng commented 10 years ago

Here are a few notes:

Some pairs are exact copies of one another, or exact copies with newlines (forbid exact matching w/strip?)
Some pairs are exact copies of quotes, with varying permutations:
- one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')
- one with surrounding quotes marks, and the other without
Some matches are the city/location lines (LOS ANGELES AFP)
I tried a very simple length filter and got:
- pairs with one sentence of 10 tokens or less: 523
- pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.

jnothman commented 10 years ago

Exact copies are surprising, given the double-length heuristic...

On 11 May 2014 17:41, Dominick Ng notifications@github.com wrote:

Here are a few notes:

Some pairs are exact copies of one another (forbid exact matching?)

Some pairs are exact copies of quotes, with varying permutations:

one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')

one with surrounding quotes marks, and the other without

Some matches are the city/location lines (LOS ANGELES AFP)

I tried a very simple length filter and got:

pairs with one sentence of 10 tokens or less: 523

pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.

— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42764316 .

jnothman commented 10 years ago

(Not double, sorry, but times some factor)

On 11 May 2014 18:59, Joel Nothman jnothman@student.usyd.edu.au wrote:

Exact copies are surprising, given the double-length heuristic...

On 11 May 2014 17:41, Dominick Ng notifications@github.com wrote:

Here are a few notes:

Some pairs are exact copies of one another (forbid exact matching?)

Some pairs are exact copies of quotes, with varying permutations:

one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')

one with surrounding quotes marks, and the other without

Some matches are the city/location lines (LOS ANGELES AFP)

I tried a very simple length filter and got:

pairs with one sentence of 10 tokens or less: 523

pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.

— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42764316 .

dominickng commented 10 years ago

Some examples are

`` The questions cannot only be solved by computers .
" The questions cannot only be solved by computers ...\n

` You should n't talk about retirement 10 minutes after the game . ''
" You should n\'t talk about retirement 10 minutes after the game . "\n

`` In fact , she is the daughter of one of our generals .
In fact , she is the daughter of one of our generals .\n

`` This man is a hero .
" This man is a hero .\n

dominickng commented 10 years ago

Right now the problem for me is that the pairs mostly seem to fall into these categories:

no useful differences between the pairs (duplication)
no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)
too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.

jnothman commented 10 years ago

Are you sure you're looking at the right data? None of the examples you've cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets matches nothing...

On 11 May 2014 19:44, Dominick Ng notifications@github.com wrote:

Right now the problem for me is that the pairs mostly seem to fall into these categories:

no useful differences between the pairs (duplication)

no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)

too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday . Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.

— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42766335 .

dominickng commented 10 years ago

I see, I'm looking at will's original data set. I'll redo the analysis

On Sunday, 11 May 2014, jnothman notifications@github.com wrote:

Are you sure you're looking at the right data? None of the examples you've cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets matches nothing...

On 11 May 2014 19:44, Dominick Ng notifications@github.com<javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Right now the problem for me is that the pairs mostly seem to fall into these categories:

no useful differences between the pairs (duplication)

no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)

too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday . Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.

— Reply to this email directly or view it on GitHub< https://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42766335> .

— Reply to this email directly or view it on GitHubhttps://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42768340 .

-Dom (on iPhone)

jnothman commented 10 years ago

too much difference between the pairs that might require synonym/etc. machinery to distinguish,

Well, if that's the path you need to go down, there are probably ways to do it, as long as we have are able to learn with constraints that aren't fully lexicalised.

jnothman commented 10 years ago

I see, I'm looking at will's original data set. I'll redo the analysis

Not that you won't see similarly useless/futile pairs!

jnothman commented 10 years ago

Having finally got a Py3k virtualenv up with the relevant dependencies, I've commited this script in b1e885f

schwa-lab / gigacluster

Finding short, simple sentences and their longer restatements #2