Closed dhruvghulati-zz closed 8 years ago
For the millions, this function is supposed to deal with it. Maybe you can reuse it?
Similarly for the issue with multi-token country names, I think this function might help.
Hmm strange I actually do use both those functions to return the indices of the tokens for each sentence, but it doesn't solve things it seems.
Perhaps because your functions only looks at patterns in between LOCATION_SLOT/NUMBER_SLOT, vs. me looking at the whole sentence sample
, my code isn't solving the issue? Otherwise I do use exactly the functions - see here.
Had a look at your code. I see there are some difference in how you use them I think compared to how I use them, e.g. here. Maybe this is what breaks it? The functions are applied to the whole sentence.
I still also use the getNumbers()
and getLocations()
function per sentence, and
define wordsInSentence = [ ]
and the combined sentence only after the conditions
are me e.g. len (sentence[ ”tokens” ])< 120
.
Will have another think later, if you think of anything else do let me know.
I think I know where this is going wrong. You have code like this:
`if sentence["tokens"][idx+1]["word"].startswith("trillion"):
number = number * 1000000000000
ids.append(idx+1)`
You append one more index to the tokenIDs2number
tuple of which token to replace with a NUMBER_SLOT
. Because I am reusing your token ID numbers you have marked to fill in the NUMBER_SLOT
, naturally I fill each slot instead of saying those two indices should be one slot. Not sure how I could adapt your slot to maybe put a marker to say those two tokens should be called one slot.
You can easily replace multiple NUMBER/LOCATION_SLOT with a single one in the output I think. But bear in mind that if you would be effectively changing the token numbering, which might create issues if you are using other elements of the sentence such as the dependency parse which rely on the original tokens ids.
When creating a new version of my parsed sentences with NUMBER_SLOT and LOCATION_SLOT filled in, I see two main issues:
1) Not dealing with millions and billions e.g.
2) Not dealing with multiple locations and collapsing to one e.g. Ivory Coast, United Kingdom, Spanish Netherlands: