ValueError: too many values to unpack for "preprocess.py"

ZhaoyueSun commented 5 years ago

When I run python preprocess.py train, it says:

Traceback (most recent call last):
  File "preprocess.py", line 103, in <module>
    for i,link in all_links:
ValueError: too many values to unpack

Do you mean: for i,link in enumerate(all_links): actually? Or if I got some mistakes when I downloaded the dataset?

Thank you very much.

ZhaoyueSun commented 5 years ago

It seems not to be for i,link in enumerate(all_links):. However, when I print all_links[:10], it shows:

u'https://gameknot.com/annotation.pl/yet-more-traxler-wilkes-barre-variation?gm=22156', u'https://gameknot.com/annotation.pl/tsscl-metros-chess-league-round-two?gm=15117', u'https://gameknot.com/annotation.pl/attacking-early-with-the-queen?gm=30258', u'https://gameknot.com/annotation.pl/alas-no-plan-for-the-piece-to-sacrifice?gm=52101', u'https://gameknot.com/annotation.pl/challenge-from-davidwh86?gm=31037', u'https://gameknot.com/annotation.pl/d-pawn-war-keeping-queens?gm=38150', u'https://gameknot.com/annotation.pl/team-match?gm=26487', u'https://gameknot.com/annotation.pl/league-division-d2?gm=52432', u'https://gameknot.com/annotation.pl/casual-game-http-www-itsyourturn-com?gm=60620', u'https://gameknot.com/annotation.pl/yahoo-kibitz?gm=12197'

Is this right and what's the problem?

harsh19 commented 5 years ago

Hi, Thanks for showing interest in our work. It seems we might have added some stale pickle files - we will update them in a couple of days.

ZhaoyueSun commented 5 years ago

Thank you for your reply and interesting work, and I'm looking forward to your soon update.

harsh19 commented 5 years ago

Hi, Apologies for the delays. Could you please check now? Thanks!

ZhaoyueSun commented 5 years ago

Hi, Thank you for your update. However, it seems that things are not completely all right now. First, in preprocess.py, line 119,

for pageNo in range(pageLength):
        if pageNo==0:
            pageObjName="./outputs/saved"+str(i)+".obj"
        else:
            pageObjName="./outputs/saved"+str(i)+"_"+str(pageNo)+".obj"

these codes might should be:

    for pageNo in range(pageLength):
        if pageNo==0:
            pageObjName="./outputs/saved"+str(i)+"_1.obj"
        else:
            pageObjName="./outputs/saved"+str(i)+"_"+str(pageNo+1)+".obj"

Because the file names I got for the htmls in saved_files folder are all with a suffix start from 1，such as “saved2_1.html", "saved7_1.html", "saved7_2.html", etc. With your codes, the program finishes quickly and nothing is stored in .che and .en files.

Second, if the codes are modified as I said, it can successfully parsed some files. But I noticed that many file names appearing in the list of train_links.p, valid_links.p and test_links.p are not shown in the saved file name list. These files are just skipped, and as a result I only got 30k pairs for the train-single situation and 50k pairs for the train-multi situation, which is much less than 298k said in your paper.

I'm not sure why this occurs. And I suggest that if you can share the dataset you have downloaded from a cloud drive?

harsh19 commented 5 years ago

Hi, I'll look at this later tonight. Meanwhile if you provide your email id, I can share the dataset via google drive. Thanks

harsh19 commented 5 years ago

Also I did a quick check, and I think preprocess code is correct - somehow you didn't get all the files it seems. See the code here: https://github.com/harsh19/ChessCommentaryGeneration/blob/master/Data/crawler/save_rendered_webpage.py#L69 For default page (page num = 0), it is saved without an '_' notation.

The issue is probably here: https://github.com/harsh19/ChessCommentaryGeneration/blob/master/Data/crawler/run_all.py#L22 where we begin the loop with value 1 instead of 0 (This is just an artefact of how we eneded up collecting our data - we first were working only with default pages, and later on downloaded remaining pages). I have updated the code, so you would get the remaining files as well.

Meanwhile feel free to provide your email id to get processed dataset directly.

ZhaoyueSun commented 5 years ago

Thanks again. I rechecked the code and it seems that you're right. I'm confused about why I missed many files, and some of the downloaded files are empty. I guess it might be the network problem. So I'll be very grateful if you can share me the dataset directly. And my email is: sunzhaoyue@seu.edu.cn

harsh19 commented 5 years ago

Emailed you the dataset. Let us know if you face any more issues. Thanks!

harsh19 / ChessCommentaryGeneration

ValueError: too many values to unpack for "preprocess.py" #2