failed to scrape - Githubissues

FYYFU commented 4 years ago

Hi, i tried to get the newsroom dataset from scratch. After i run the newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive, the dev.archive seems not apppear. It takes me 10 days to run the command and there is nothing left. So i wonder if there is something wrong to scrape from scratch?

And the final result i get is：

108800 pages need re-downloading later.

108810 pages need re-downloading later.

108820 pages need re-downloading later.

108830 pages need re-downloading later.

Rerun the script: 108837 pages failed to download.
- Try running with a lower --workers count (default = 16).
- Check which URLs are left with the --diff flag.
- Last resort: --exactness X to truncate dates to X digits.
  (e.g., --exactness 4 will download the closest year.)

Downloading Summaries: 100%|██████████| 108837/108837 [249:09:03<00:00,  8.24s/it]

yoavartzi commented 4 years ago

I am not sure what the issue is, and I am not sure the person who wrote this code is available to debug. I recommend requesting the dataset through our form: https://cornell.qualtrics.com/jfe/form/SV_6YA3HQ2p75XH4IR

FYYFU commented 4 years ago

Thank you very much. I have download the dataset and get some information from the Newsroom website.

lil-lab / newsroom

failed to scrape #23