keras-team / keras-hub

Pretrained model hub for Keras 3
Apache License 2.0
763 stars 230 forks source link

Problems of running BERT example: BookCorpus cannot be downloaded #37

Closed chenmoneygithub closed 2 years ago

chenmoneygithub commented 2 years ago

Describe the bug Downloading bookcorpus via the [repo mentioned in BERT instruction]((https://github.com/soskek/bookcorpus/blob/master/README.md) hit an error: HTTPError: HTTP Error 503: Service Temporarily Unavailable Failed to open https://www.smashwords.com/books/download/459173/6/latest/0/0/imperfect-chemistry.txt

This might be transient since the error code is 503, but we need to further check it.

To Reproduce

git clone https://github.com/soskek/bookcorpus.git cd bookscorpus python download_files.py --list url_list.jsonl --out out_txts --trash-bad-count

Expected behavior Should be able to download bookscorpus dataset.

mattdangerw commented 2 years ago

I would suggest using the "file by Shawn Presser" at the top of the README. That skips running the code to recreate.

Another option would be huggingface datasets, though some work would be needed to get these out of their format.

I'll update the README to include a few different sources, we probably shouldn't try to list just one as there's no official source anymore.