Possible bug in extractor.lua

Harold-Lee-PKU commented 6 years ago

1.When I ran extractor.lua, it tells me it can't find "MarginalNLLCriterion". I assume that it should be "opnmt.modules.MarginalNLLCriterion"? 2.There is a problem in loading the pre-trained model. It says: " extractor.lua:549: bad argument #1 to 'copy' (sizes do not match". I think it's because the downloaded model's parameter sizes are inconsistent with the parameters specified in the extractor.lua?

swiseman commented 6 years ago

Yes, sorry about that; that's the same MarginalNLLCriterion in onmt.modules; it may be easiest to just symlink to it.
Hmm, I just reran and didn't get a problem. Someone emailed me with a similar issue, but they just needed to pull again. Assuming that's not the issue for you, what are the two incompatible sizes you're getting?

Harold-Lee-PKU commented 6 years ago

Thank you for your quick response! As for the second issue, I printed out the "mod" created at line 546, it's: nn.Sequential { [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output] (1): nn.ParallelTable { input |-> (1): nn.LookupTable |-> (2): nn.LookupTable -> (3): nn.LookupTable ... -> output } (2): nn.JoinTable (3): nn.ConcatTable { input |-> (1): nn.Sequential { | [input -> (1) -> (2) -> (3) -> output] | (1): cudnn.TemporalConvolution | (2): nn.ReLU | (3): nn.Max | } |-> (2): nn.Sequential { | [input -> (1) -> (2) -> (3) -> output] | (1): cudnn.TemporalConvolution | (2): nn.ReLU | (3): nn.Max | } -> (3): nn.Sequential { [input -> (1) -> (2) -> (3) -> output] (1): cudnn.TemporalConvolution (2): nn.ReLU (3): nn.Max } ... -> output } (4): nn.JoinTable (5): nn.Dropout(0.500000) (6): nn.Linear(600 -> 500) (7): nn.ReLU (8): nn.Dropout(0.500000) (9): nn.Linear(500 -> 33) (10): nn.SoftMax } But I can't see the "saved_p"'s parameter sizes. Is there a way to print that out? Sorry I'm not very familiar with torch.

swiseman commented 6 years ago

After line 547 (in the current extractor.lua) can you add:

print("num conv params:", p:size(1))

and after line 548 (in the current extractor.lua) can you add:

print("num loaded params:", saved_p:size(1))

That will tell us how big your conv model is and how big the parameter vector you're trying to load is.

wanghechong commented 6 years ago

hi swiseman i am sorry to email you, i don't know the anwser is here,i accept the misatake the same with above, there are two same mistake,when bulid conv mode and lstm mode,but i found the size of our build conv mode is 2133533,the downloaded mode's size is 2141733, and the size of our build lstm mode is 4166133,the downloaded mode's size is 4174333, the difference between them is the same ,it is 8200, so i doubt it must be some argument are wrong,but i don't know where, i try mang times.

Harold-Lee-PKU commented 6 years ago

Sorry for the delay. I edited code as you said. I also commented the "copy" line and added two lines in the lstm loading part. Here is what I got:

num conv params: 2133533 num loaded 2141733 num conv params: 2133533 num loaded 2141733 num conv params: 2133533 num loaded 2141733 num lstm params: 4166133 num lstm loaded: 4174333 num lstm params: 4166133 num lstm loaded: 4174333 num lstm params: 4166133 num lstm loaded: 4174333

Is the code I am using out-of-date or are the models I downloaded corrupted? Thank you!

swiseman commented 6 years ago

Ok thanks, both, for the info. I get 2141733 for both "num conv params" and "num loaded," so at least the models you've downloaded are correct. I'm beginning to think maybe something is different in your hdf5 files. Can you also put

print(vocab_sizes)

after line 146 in extractor.lua?

wanghechong commented 6 years ago

i also print it ,it is {1:4892,2:183,3:192},and the number of labels is 33,and i also see the paper told it is 39, after check the roto-ie.labels,it's content like this {PLAYER-FG_PCT 1 TEAM-PTS_QTR2 2 TEAM-TOV 3 TEAM-FT_PCT 4 PLAYER-FG3M 5 PLAYER-TO 6 NONE 7 TEAM-PTS 8 PLAYER-BLK 9 TEAM-REB 10 TEAM-PTS_QTR3 11 PLAYER-AST 12 TEAM-LOSSES 13 PLAYER-FGM 14 PLAYER-OREB 15 PLAYER-PF 16 PLAYER-REB 17 TEAM-PTS_QTR1 18 TEAM-AST 19 TEAM-FG3_PCT 20 PLAYER-MIN 21 PLAYER-DREB 22 TEAM-PTS_QTR4 23 PLAYER-FG3_PCT 24 PLAYER-FTA 25 PLAYER-FGA 26 PLAYER-PTS 27 TEAM-WINS 28 PLAYER-FG3A 29 PLAYER-STL 30 TEAM-FG_PCT 31 PLAYER-FT_PCT 32 PLAYER-FTM 33}compare the bs_key and ls_key , there are 6 label missing, and if i manually change the labels to 39 ,the number of model we build is 2136539 the version of h5py i use is 2.7.1 , nltk is 3.2.5

Harold-Lee-PKU commented 6 years ago

I also got the same number. And maybe the label number is supposed to be less than 39? I think some labels are not used when creating triple in preprocess.lua? I am not very sure.

swiseman commented 6 years ago

The issue seems to be that I have more words in my vocab (I get 4933 instead of 4892). I'm not sure what is causing this discrepancy yet but I'll post back when I figure it out.

Harold-Lee-PKU commented 6 years ago

Could it be because in "box_preprocess.lua", at line 483, when creating the vocabulary, we only used jsondat_train rather than the whole dataset?

wanghechong commented 6 years ago

it's not the box_preprocess.lua make mistake, i think it is data_untils.py, i print the function save_full_sent_data in data_untils.py, i found that len(datasets[0]) = 28842,and before we delete the word's frequence which is less than 2 times len(word_counter) = 7536, after we delete the word ,the number is 4891(after plus one 4891 will be 4892), so i check get_datasets function and append_candidate_rels function, i fail to find mistake until now.

swiseman commented 6 years ago

Ok, very sorry about all of this; the issue was that the tokenization in data_utils.py (to prepare sentences for the extractor) was supposed to only split on whitespace, whereas it was calling nltk's word_tokenize. (I suspect this was erroneously merged in from a different version of the code). This gives a different vocabulary size, and that's why the parameter sizes didn't match. In any case, I think if you pull now and recreate the hdf5 files you'll be all set. Sorry about all the confusion, and let me know if you end up running into any more issues.

Harold-Lee-PKU commented 6 years ago

Thank you for your patience! I actually got another error. It's an index-out-of-range error in extractor.lua at line 444. I was using output from my own model rather than the provided pre-trained model. So I am wondering if there is any requirement regarding the format of the "roto_cc-beam5_gens.txt". Should the generated summary be formatted in a certain way? The output file I used has been attached. output_by_model.txt

swiseman commented 6 years ago

@wanghechong : The recent changes should not have changed anything regarding the ignoreIdx, and if the ignoreIdx is different, the pretrained models won't work. Can you check and try again? (I recently reran from scratch and everything seems ok).

@Harold-Lee-PKU : Yes, good catch; the information extractor uses distances between entities and numbers in each sentence as features, and I didn't clamp distances extracted from new, generated text to ensure they are no bigger than those seen at training time (as they were in your attached predictions). I just pushed a fix, and I was able to run on your attached predictions (though note that there seems to be a missing newline at the end of the file).

@wanghechong and @Harold-Lee-PKU : since this thread seems to be diverging in topic now, I'm going to close this issue. If you continue to run into problems, please either email me (swiseman at seas dot harvard dot edu) or open a new issue on this repo.

harvardnlp / data2text

Possible bug in extractor.lua #3