microsoft / DialoGPT

Large-scale pretraining for dialogue
MIT License
2.35k stars 342 forks source link

Problem for downloading data of reddit #4

Open SparkJiao opened 4 years ago

SparkJiao commented 4 years ago

Hi, great thanks to your contribution!

I try to use python demo.py --data full to download the reddit data. For I don't want to train the model now I didn't use the docker. I find that the link to the data is here: https://convaisharables.blob.core.windows.net/lsp/keys-full.tar It seems that I can't open that even with proxy. So do you have any other link to the reddit data?

Sorry to bother you. Thank you very much !

intersun commented 4 years ago

I just checked the link worked on my side, can you double check with it again?

SparkJiao commented 4 years ago

@intersun Hi, thanks for your reply. Indeed the link in normal and I could download the keys-full.tar. But I have encountered other problems.

  1. I think the path for saving keys-full.tar is wrong. In the makefile, it's saved under ./reddit_extractor/, but the make command wants to find it under ./reddit_extractor/data/.
  2. I move the keys-full.tar to the directory ./reddit_extractor/data/ and comment the wget command and then re-run the demo.py and I got following error report. Is this because the keys-full.tar file are damaged during downloading or other reasons?
    
    11/05/2019 22:20:46 - INFO - __main__ -   Downloading and Extracting Data...
    make: *** [data/reddit/RC_2011-02.bz2] Error 4
    make: *** Waiting for unfinished jobs....
    11/06/2019 01:46:10 - INFO - __main__ -   Preparing Data...
    prepro.py --corpus ./data/train.tsv --max_seq_len 128
    11/06/2019 01:48:21 - INFO - __main__ -   Done!

11/06/2019 01:48:21 - INFO - main - Generating training CMD!


Besides, the file `.data/train.tsv` doesn't exist.

Thanks for your help very much!
kinoc commented 4 years ago

I had a similar problem, but appears to make progress after re-clone of the repository. I think the process does not like doing "--data full" after doing "--data small".

createmomo commented 3 years ago

@intersun Hi, thanks for your reply. Indeed the link in normal and I could download the keys-full.tar. But I have encountered other problems.

  1. I think the path for saving keys-full.tar is wrong. In the makefile, it's saved under ./reddit_extractor/, but the make command wants to find it under ./reddit_extractor/data/.
  2. I move the keys-full.tar to the directory ./reddit_extractor/data/ and comment the wget command and then re-run the demo.py and I got following error report. Is this because the keys-full.tar file are damaged during downloading or other reasons?
11/05/2019 22:20:46 - INFO - __main__ -   Downloading and Extracting Data...
make: *** [data/reddit/RC_2011-02.bz2] Error 4
make: *** Waiting for unfinished jobs....
11/06/2019 01:46:10 - INFO - __main__ -   Preparing Data...
prepro.py --corpus ./data/train.tsv --max_seq_len 128
11/06/2019 01:48:21 - INFO - __main__ -   Done!

11/06/2019 01:48:21 - INFO - __main__ -   Generating training CMD!

Besides, the file .data/train.tsv doesn't exist.

Thanks for your help very much!

I have the same problem here (Error of RC_2011-02.bz2), although I am using the latest repository. Did you solve this problem?