uclnlp / simpleNumericalFactChecker

Fact checker for simple claims about statistical properties
26 stars 5 forks source link

locationTokenIDs and numberTokenIDs being duplicated in certain cases #14

Closed dhruvghulati-zz closed 8 years ago

dhruvghulati-zz commented 8 years ago

When creating a new version of my parsed sentences with NUMBER_SLOT and LOCATION_SLOT filled in, I see two main issues:

1) Not dealing with millions and billions e.g.

    "parsedSentence": "Trade with LOCATION_SLOT has increased NUMBER_SLOT % since 2004 -LRB- TIFA -RRB- up t0 NUMBER_SLOT NUMBER_SLOT dollars from NUMBER_SLOT NUMBER_SLOT in 2004 .", 
    "sentence": "Trade with America has increased 1333 % since 2004 -LRB- TIFA -RRB- up t0 2.2 billion dollars from 150 million in 2004 ."

2) Not dealing with multiple locations and collapsing to one e.g. Ivory Coast, United Kingdom, Spanish Netherlands:

 "parsedSentence": "The LOCATION_SLOT LOCATION_SLOT in 1600 had NUMBER_SLOT NUMBER_SLOT ; in 1650 , NUMBER_SLOT NUMBER_SLOT .", 
    "sentence": "The Spanish Netherlands in 1600 had 1.5 million ; in 1650 , 1.9 million ."

 "parsedSentence": "While LOCATION_SLOT produces and exports heavy crude , it imports NUMBER_SLOT NUMBER_SLOT CFA francs of light crude oil -LRB- which is suitable for its refinery -RRB- from LOCATION_SLOT , LOCATION_SLOT LOCATION_SLOT , LOCATION_SLOT LOCATION_SLOT , LOCATION_SLOT , and LOCATION_SLOT .", 
    "sentence": "While Cameroon produces and exports heavy crude , it imports 73.3 billion CFA francs of light crude oil -LRB- which is suitable for its refinery -RRB- from Nigeria , Equatorial Guinea , Ivory Coast , Angola , and Italy ."`

   "parsedSentence": "Migration from Saint Lucia is primarily to Anglophone countries , with the LOCATION_SLOT LOCATION_SLOT -LRB- see Saint Lucian British -RRB- having almost NUMBER_SLOT Saint Lucian-born citizens , and over NUMBER_SLOT of Saint Lucian heritage .", 
    "sentence": "Migration from Saint Lucia is primarily to Anglophone countries , with the United Kingdom -LRB- see Saint Lucian British -RRB- having almost 10,000 Saint Lucian-born citizens , and over 30,000 of Saint Lucian heritage ."
andreasvlachos commented 8 years ago

For the millions, this function is supposed to deal with it. Maybe you can reuse it?

Similarly for the issue with multi-token country names, I think this function might help.

dhruvghulati-zz commented 8 years ago

Hmm strange I actually do use both those functions to return the indices of the tokens for each sentence, but it doesn't solve things it seems.

Perhaps because your functions only looks at patterns in between LOCATION_SLOT/NUMBER_SLOT, vs. me looking at the whole sentence sample, my code isn't solving the issue? Otherwise I do use exactly the functions - see here.

andreasvlachos commented 8 years ago

Had a look at your code. I see there are some difference in how you use them I think compared to how I use them, e.g. here. Maybe this is what breaks it? The functions are applied to the whole sentence.

dhruvghulati-zz commented 8 years ago

I still also use the getNumbers() and getLocations() function per sentence, and define wordsInSentence = [ ] and the combined sentence only after the conditions are me e.g. len (sentence[ ”tokens” ])< 120. Will have another think later, if you think of anything else do let me know.

dhruvghulati-zz commented 8 years ago

I think I know where this is going wrong. You have code like this:

   `if sentence["tokens"][idx+1]["word"].startswith("trillion"):
                        number = number * 1000000000000
                        ids.append(idx+1)`

You append one more index to the tokenIDs2number tuple of which token to replace with a NUMBER_SLOT. Because I am reusing your token ID numbers you have marked to fill in the NUMBER_SLOT, naturally I fill each slot instead of saying those two indices should be one slot. Not sure how I could adapt your slot to maybe put a marker to say those two tokens should be called one slot.

andreasvlachos commented 8 years ago

You can easily replace multiple NUMBER/LOCATION_SLOT with a single one in the output I think. But bear in mind that if you would be effectively changing the token numbering, which might create issues if you are using other elements of the sentence such as the dependency parse which rely on the original tokens ids.