hpzhao / SummaRuNNer

The PyTorch Implementation of SummaRuNNer
https://arxiv.org/pdf/1611.04230.pdf
MIT License
253 stars 81 forks source link

embedding.pkl #7

Closed jiangix01 closed 6 years ago

jiangix01 commented 6 years ago

Hi, Sorry for bothering you. I want to know how to get embedding.pkl. Is it trained by word2vec? And if so, what is the train dataset of word2vec. is it all data from train.pkl + validation.pkl + test.pkl? Thanks a lot.

hpzhao commented 6 years ago

Yep. The embedding is trained by word2vec with all data (train , dev and test)。 And the word min count is 5. @jiangix01

jiangix01 commented 6 years ago

Yeah, Thanks a lot

sai-prasanna commented 6 years ago

@hpzhao Doesn't cnn/dailymail data set have proper nouns replaced as some random entity id such as @entity1 @entity2 etc ?

In the pickled data I found some entity's replaced with stuff like RAF1, RAF2 or Daily Mail1 etc .. and embedding there for Daily and Mail1 separately. And some of it was not replaced.

Is this intended?

hpzhao commented 6 years ago

Hi. In the source data https://docs.google.com/uc?id=0B0Obe9L1qtsnSXZEd0JCenIyejg&export=download , entities has been replaced by id, such as @entity175:F1. And I has replaced the id by its real entity name in my pickled data and embedding.

2018-01-03 18:39 GMT+08:00 Sai notifications@github.com:

@hpzhao https://github.com/hpzhao Isn't daily mail data set with entity's replaced as some random entity id ?

In the pickled data I found some entity's replaced with stuff like RAF1, RAF2 or Daily Mail1 etc .. and embedding there for Daily and Mail1 separately .. Is this intended?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpzhao/SummaRuNNer/issues/7#issuecomment-354981895, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1zG8KTm50iqN3C2BPBVo6kIf33HxpMks5tG1jegaJpZM4RQEJ- .

sai-prasanna commented 6 years ago

Oh, thanks for clarification.

I downloaded the pkl files, some of the entities are not replaced properly.

For eg.

Original Document:


http://web.archive.org/web/20130311004512id_/http://www.dailymail.co.uk:80/news/article-2288068/Brian-Cashman-Yankees-general-manager-breaks-leg-skydiving-charity.html

by @entity0 published : 16:17 est , 4 march 2013 updated : 17:40 est , 4 march 2013 @entity2 general manager @entity1 today broke his leg and dislocated his ankle after a fall - from 12,500 feet in the air           1
he will now join several of his players on the disabled list after the misstep while skydiving for charity today            2
the incident occurred at the @entity11 near @entity12 , @entity13 , as @entity1 did his second jump with the @entity16 ’s @entity15         2
thrill : @entity1 broke his fibula and dislocated his ankle during his second skydive attempt with the @entity16 ¿ s @entity15 in @entity12 , @entity13 one for the money : @entity1 suffered the injuries on monday while skydiving at @entity11 after a successful jump earlier this morning , @entity1 reportedly found the experience so exhilarating that wanted to go up again - a big mistake            1
during the landing , his foot became caught in the ground , breaking his right fibula and dislocating the right ankle , the @entity2 said           1
@entity1 told the @entity30 ' i heard a pop in my ankle ' as he made the landing            2
he was scheduled to undergo surgery later today to fix the broken bone          1
@entity1 took the leap out of the plane to raise awareness about the @entity40          1
the @entity41 reported that the event was the first times he had attempted skydiving , but despite his injuries , it may not be the last            0
broken : @entity1 , pictured right with @entity2 manager @entity46 , will now join some of the players on the disabled list , as he says he requires surgery despite the injury , @entity1 texted reporters on the way to the hospital to say that the leap was ' an awesome experience         1
' and @entity1 is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story @entity58 in @entity60 , @entity61 , during the holidays in the past few years         0
his injury comes about two months after it was revealed that he was leading a ' triple life ' - accused of cheating on his wife with multiple women         0
@entity67 , the mother of @entity68 - one of @entity1 's alleged mistresses - filed an explosive lawsuit in @entity71 in january , accusing @entity1 of being a ' manchild ' who conspired to have his ex-lover committed so the affair would never be revealed         0
the explosive suit also accuses @entity1 of using scare tactics to force @entity67 into helping him , his lawyer and her daughter 's therapist - who are referred to as ' the gang ' - and to turn against @entity83 with the sole purpose of discrediting her          0
for a good cause : @entity1 , pictured third from left , is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story @entity58 in @entity60 , @entity61 , in the past few @entity90 seasons           0

@entity2 @entity92 made the jump to raise awareness for the @entity40
breaks right fibula and dislocates right ankle as his leg got *snagged* on the ground
scheduled to undergo surgery later on monday

@entity2:Yankees
@entity1:Cashman
@entity0:Thomas Durante
@entity13:Florida
@entity12:Miami
@entity11:Homestead Air Force Base
@entity16:U.S. Army
@entity61:Connecticut
@entity15:Golden Knights
@entity46:Joe Girardi
@entity83:Neathway
@entity40:Wounded Warrior Project
@entity41:YES Network
@entity68:Louise Neathway
@entity67:Meanwell
@entity30:New York Daily News
@entity58:Landmark Building
@entity71:Manhattan Supreme Court
@entity92:GM
@entity60:Stamford
@entity90:Christmas

Pickled content -


['by Thomas Durante published : 16:17 est , 4 march 2013 updated : 17:40 est , 4 march 2013 Yankees general manager Cashman today broke his leg and dislocated his ankle after a fall - from 12,500 feet in the air',
 'he will now join several of his players on the disabled list after the misstep while skydiving for charity today',
 'the incident occurred at the Cashman1 near Cashman2 , Cashman3 , as Cashman did his second jump with the Cashman6 ’s Cashman5',
 'thrill : Cashman broke his fibula and dislocated his ankle during his second skydive attempt with the Cashman6 ¿ s Cashman5 in Cashman2 , Cashman3 one for the money : Cashman suffered the injuries on monday while skydiving at Cashman1 after a successful jump earlier this morning , Cashman reportedly',
 'during the landing , his foot became caught in the ground , breaking his right fibula and dislocating the right ankle , the Yankees said',
 "Cashman told the New York Daily News ' i heard a pop in my ankle ' as he made the landing",
 'he was scheduled to undergo surgery later today to fix the broken bone',
 'Cashman took the leap out of the plane to raise awareness about the Wounded Warrior Project',
 'the YES Network reported that the event was the first times he had attempted skydiving , but despite his injuries , it may not be the last',
 "broken : Cashman , pictured right with Yankees manager Joe Girardi , will now join some of the players on the disabled list , as he says he requires surgery despite the injury , Cashman texted reporters on the way to the hospital to say that the leap was '",
 "' and Cashman is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story Landmark Building in Stamford , Connecticut , during the holidays in the past few years",
 "his injury comes about two months after it was revealed that he was leading a ' triple life ' - accused of cheating on his wife with multiple women",
 "Meanwell , the mother of Louise Neathway - one of Cashman 's alleged mistresses - filed an explosive lawsuit in Manhattan Supreme Court in january , accusing Cashman of being a ' manchild ' who conspired to have his ex-lover committed so the affair would never be revealed",
 "the explosive suit also accuses Cashman of using scare tactics to force Meanwell into helping him , his lawyer and her daughter 's therapist - who are referred to as ' the gang ' - and to turn against Neathway with the sole purpose of discrediting her",
 'for a good cause : Cashman , pictured third from left , is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story Landmark Building in Stamford , Connecticut , in the past few Christmas seasons']

I guess replacing @entity1 by Cashman also replaced @entity12 into Cashman2.

hpzhao commented 6 years ago

Thanks a lot for bug reporting! I find a bug in my preprocessing file.

2018-01-03 20:28 GMT+08:00 Sai notifications@github.com:

Oh, thanks for clarification.

I downloaded the pkl files, some of the entities are not replaced properly.

For eg.

Original Document:

http://web.archive.org/web/20130311004512id_/http://www.dailymail.co.uk:80/news/article-2288068/Brian-Cashman-Yankees-general-manager-breaks-leg-skydiving-charity.html

by @entity0 published : 16:17 est , 4 march 2013 updated : 17:40 est , 4 march 2013 @entity2 general manager @entity1 today broke his leg and dislocated his ankle after a fall - from 12,500 feet in the air 1 he will now join several of his players on the disabled list after the misstep while skydiving for charity today 2 the incident occurred at the @entity11 near @entity12 , @entity13 , as @entity1 did his second jump with the @entity16 ’s @entity15 2 thrill : @entity1 broke his fibula and dislocated his ankle during his second skydive attempt with the @entity16 ¿ s @entity15 in @entity12 , @entity13 one for the money : @entity1 suffered the injuries on monday while skydiving at @entity11 after a successful jump earlier this morning , @entity1 reportedly found the experience so exhilarating that wanted to go up again - a big mistake 1 during the landing , his foot became caught in the ground , breaking his right fibula and dislocating the right ankle , the @entity2 said 1 @entity1 told the @entity30 ' i heard a pop in my ankle ' as he made the landing 2 he was scheduled to undergo surgery later today to fix the broken bone 1 @entity1 took the leap out of the plane to raise awareness about the @entity40 1 the @entity41 reported that the event was the first times he had attempted skydiving , but despite his injuries , it may not be the last 0 broken : @entity1 , pictured right with @entity2 manager @entity46 , will now join some of the players on the disabled list , as he says he requires surgery despite the injury , @entity1 texted reporters on the way to the hospital to say that the leap was ' an awesome experience 1 ' and @entity1 is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story @entity58 in @entity60 , @entity61 , during the holidays in the past few years 0 his injury comes about two months after it was revealed that he was leading a ' triple life ' - accused of cheating on his wife with multiple women 0 @entity67 , the mother of @entity68 - one of @entity1 's alleged mistresses - filed an explosive lawsuit in @entity71 in january , accusing @entity1 of being a ' manchild ' who conspired to have his ex-lover committed so the affair would never be revealed 0 the explosive suit also accuses @entity1 of using scare tactics to force @entity67 into helping him , his lawyer and her daughter 's therapist - who are referred to as ' the gang ' - and to turn against @entity83 with the sole purpose of discrediting her 0 for a good cause : @entity1 , pictured third from left , is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story @entity58 in @entity60 , @entity61 , in the past few @entity90 seasons 0

@entity2 @entity92 made the jump to raise awareness for the @entity40 breaks right fibula and dislocates right ankle as his leg got snagged on the ground scheduled to undergo surgery later on monday

@entity2:Yankees @entity1:Cashman @entity0:Thomas Durante @entity13:Florida @entity12:Miami @entity11:Homestead Air Force Base @entity16:U.S. Army @entity61:Connecticut @entity15:Golden Knights @entity46:Joe Girardi @entity83:Neathway @entity40:Wounded Warrior Project @entity41:YES Network @entity68:Louise Neathway @entity67:Meanwell @entity30:New York Daily News @entity58:Landmark Building @entity71:Manhattan Supreme Court @entity92:GM @entity60:Stamford @entity90:Christmas

Pickled content -

['by Thomas Durante published : 16:17 est , 4 march 2013 updated : 17:40 est , 4 march 2013 Yankees general manager Cashman today broke his leg and dislocated his ankle after a fall - from 12,500 feet in the air', 'he will now join several of his players on the disabled list after the misstep while skydiving for charity today', 'the incident occurred at the Cashman1 near Cashman2 , Cashman3 , as Cashman did his second jump with the Cashman6 ’s Cashman5', 'thrill : Cashman broke his fibula and dislocated his ankle during his second skydive attempt with the Cashman6 ¿ s Cashman5 in Cashman2 , Cashman3 one for the money : Cashman suffered the injuries on monday while skydiving at Cashman1 after a successful jump earlier this morning , Cashman reportedly', 'during the landing , his foot became caught in the ground , breaking his right fibula and dislocating the right ankle , the Yankees said', "Cashman told the New York Daily News ' i heard a pop in my ankle ' as he made the landing", 'he was scheduled to undergo surgery later today to fix the broken bone', 'Cashman took the leap out of the plane to raise awareness about the Wounded Warrior Project', 'the YES Network reported that the event was the first times he had attempted skydiving , but despite his injuries , it may not be the last', "broken : Cashman , pictured right with Yankees manager Joe Girardi , will now join some of the players on the disabled list , as he says he requires surgery despite the injury , Cashman texted reporters on the way to the hospital to say that the leap was '", "' and Cashman is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story Landmark Building in Stamford , Connecticut , during the holidays in the past few years", "his injury comes about two months after it was revealed that he was leading a ' triple life ' - accused of cheating on his wife with multiple women", "Meanwell , the mother of Louise Neathway - one of Cashman 's alleged mistresses - filed an explosive lawsuit in Manhattan Supreme Court in january , accusing Cashman of being a ' manchild ' who conspired to have his ex-lover committed so the affair would never be revealed", "the explosive suit also accuses Cashman of using scare tactics to force Meanwell into helping him , his lawyer and her daughter 's therapist - who are referred to as ' the gang ' - and to turn against Neathway with the sole purpose of discrediting her", 'for a good cause : Cashman , pictured third from left , is no stranger to extreme sports to benefit charity , as he has rappelled down the 22 - story Landmark Building in Stamford , Connecticut , in the past few Christmas seasons']

I guess replacing @entity1 by Cashman also replaced @entity12 into Cashman2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpzhao/SummaRuNNer/issues/7#issuecomment-355000707, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1zGy-DHQtc4mc-DmE2ZVHLW1mtrmg4ks5tG3J7gaJpZM4RQEJ- .

sai-prasanna commented 6 years ago

@hpzhao No problem.

I have few questions.

  1. will entity anonymization help in transfer of model across domains, or not.

  2. And incase of entity anonymization, if we have random entity id in each documents , how does the embedding capture anything meaningful? or should the entityid embeddings which are randomized be put non trainable

  3. Any idea on what library is used to anonymize entities + coreference inference in https://github.com/deepmind/rc-data Hermann paper?

hpzhao commented 6 years ago

Interesting! I also thought about it. I think the random entityid embeddings may not be a good idea. We can utilize a solid embedding for all named entities or we can set a few embeddings for different categories such as People Name, Place etc.

2018-01-03 21:55 GMT+08:00 Sai notifications@github.com:

@hpzhao https://github.com/hpzhao No problem.

I have a quick question though, will entity anonymization help in transfer of model across domains, or not.

And incase of entity anonymyzation, if we have random entity id in each documents , how does the embedding capture anything meaningful? or should the entityid embeddings which are randomized be put non trainable

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpzhao/SummaRuNNer/issues/7#issuecomment-355017113, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1zG85gzemRJ3eUfzNHJjD9xCLPOHOeks5tG4bKgaJpZM4RQEJ- .

hpzhao commented 6 years ago

BTW, I think both ways should be training the embeddings.

2018-01-03 23:32 GMT+08:00 赵怀鹏 huaipengzhao@gmail.com:

Interesting! I also thought about it. I think the random entityid embeddings may not be a good idea. We can utilize a solid embedding for all named entities or we can set a few embeddings for different categories such as People Name, Place etc.

2018-01-03 21:55 GMT+08:00 Sai notifications@github.com:

@hpzhao https://github.com/hpzhao No problem.

I have a quick question though, will entity anonymization help in transfer of model across domains, or not.

And incase of entity anonymyzation, if we have random entity id in each documents , how does the embedding capture anything meaningful? or should the entityid embeddings which are randomized be put non trainable

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpzhao/SummaRuNNer/issues/7#issuecomment-355017113, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1zG85gzemRJ3eUfzNHJjD9xCLPOHOeks5tG4bKgaJpZM4RQEJ- .

AlJohri commented 6 years ago

This code also produces non-anonymized versions of the dataset in case it's helpful: https://github.com/abisee/cnn-dailymail

They also have a full download of the non-anonymized data available.

hpzhao commented 6 years ago

Thanks, it helps.

2018-01-23 10:43 GMT+08:00 Al Johri notifications@github.com:

This code also produces non-anonymized versions of the dataset in case it's helpful: https://github.com/abisee/cnn-dailymail

They also have a full download of the non-anonymized data available.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpzhao/SummaRuNNer/issues/7#issuecomment-359655594, or mute the thread https://github.com/notifications/unsubscribe-auth/AV1zG__6Jp-Va3GWIIRXdsP_PzChGdKzks5tNUdOgaJpZM4RQEJ- .