Closed ZhaoyueSun closed 5 years ago
It seems not to be for i,link in enumerate(all_links):
.
However, when I print all_links[:10]
, it shows:
u'https://gameknot.com/annotation.pl/yet-more-traxler-wilkes-barre-variation?gm=22156', u'https://gameknot.com/annotation.pl/tsscl-metros-chess-league-round-two?gm=15117', u'https://gameknot.com/annotation.pl/attacking-early-with-the-queen?gm=30258', u'https://gameknot.com/annotation.pl/alas-no-plan-for-the-piece-to-sacrifice?gm=52101', u'https://gameknot.com/annotation.pl/challenge-from-davidwh86?gm=31037', u'https://gameknot.com/annotation.pl/d-pawn-war-keeping-queens?gm=38150', u'https://gameknot.com/annotation.pl/team-match?gm=26487', u'https://gameknot.com/annotation.pl/league-division-d2?gm=52432', u'https://gameknot.com/annotation.pl/casual-game-http-www-itsyourturn-com?gm=60620', u'https://gameknot.com/annotation.pl/yahoo-kibitz?gm=12197'
Is this right and what's the problem?
Hi, Thanks for showing interest in our work. It seems we might have added some stale pickle files - we will update them in a couple of days.
Thank you for your reply and interesting work, and I'm looking forward to your soon update.
Hi, Apologies for the delays. Could you please check now? Thanks!
Hi, Thank you for your update. However, it seems that things are not completely all right now. First, in preprocess.py, line 119,
for pageNo in range(pageLength):
if pageNo==0:
pageObjName="./outputs/saved"+str(i)+".obj"
else:
pageObjName="./outputs/saved"+str(i)+"_"+str(pageNo)+".obj"
these codes might should be:
for pageNo in range(pageLength):
if pageNo==0:
pageObjName="./outputs/saved"+str(i)+"_1.obj"
else:
pageObjName="./outputs/saved"+str(i)+"_"+str(pageNo+1)+".obj"
Because the file names I got for the htmls in saved_files folder are all with a suffix start from 1,such as “saved2_1.html", "saved7_1.html", "saved7_2.html", etc. With your codes, the program finishes quickly and nothing is stored in .che and .en files.
Second, if the codes are modified as I said, it can successfully parsed some files. But I noticed that many file names appearing in the list of train_links.p, valid_links.p and test_links.p are not shown in the saved file name list. These files are just skipped, and as a result I only got 30k pairs for the train-single situation and 50k pairs for the train-multi situation, which is much less than 298k said in your paper.
I'm not sure why this occurs. And I suggest that if you can share the dataset you have downloaded from a cloud drive?
Hi, I'll look at this later tonight. Meanwhile if you provide your email id, I can share the dataset via google drive. Thanks
Also I did a quick check, and I think preprocess code is correct - somehow you didn't get all the files it seems. See the code here:
https://github.com/harsh19/ChessCommentaryGeneration/blob/master/Data/crawler/save_rendered_webpage.py#L69
For default page (page num = 0), it is saved without an '_
The issue is probably here: https://github.com/harsh19/ChessCommentaryGeneration/blob/master/Data/crawler/run_all.py#L22 where we begin the loop with value 1 instead of 0 (This is just an artefact of how we eneded up collecting our data - we first were working only with default pages, and later on downloaded remaining pages). I have updated the code, so you would get the remaining files as well.
Meanwhile feel free to provide your email id to get processed dataset directly.
Thanks again. I rechecked the code and it seems that you're right. I'm confused about why I missed many files, and some of the downloaded files are empty. I guess it might be the network problem. So I'll be very grateful if you can share me the dataset directly. And my email is: sunzhaoyue@seu.edu.cn
Emailed you the dataset. Let us know if you face any more issues. Thanks!
When I run
python preprocess.py train
, it says:Do you mean:
for i,link in enumerate(all_links):
actually? Or if I got some mistakes when I downloaded the dataset?Thank you very much.