pgcorpus / gutenberg

Pipeline to generate the Standardized Project Gutenberg Corpus
https://zenodo.org/record/2422561
GNU General Public License v3.0
157 stars 39 forks source link

Not windows-friendly things #37

Open fontclos opened 4 years ago

fontclos commented 4 years ago

There are a couple of things not 'windows friendly' - this rsync error is the first you'll hit.:

  1. subprocess calls to rsync, wgetandln. Windows needs cwRsync, wget installed and added to path, and use mklink instead of ln
  2. "\" vs "/" as a hardcoded file separator. Search for "/" and replace with os.path.sep

Working windows version a WIP - will push when/if I have it working. Alternatively (win10) run in ubuntu_on_windows.

Originally posted by @aegis1980 in https://github.com/pgcorpus/gutenberg/issues/30#issuecomment-671015671

TessDejaeghere commented 2 years ago

rsync: failed to connect to aleph.gutenberg.org (65.50.255.20): Connection refused (111) rsync: failed to connect to aleph.gutenberg.org (2604:3200:0:3:1618:77ff:fe49:8a7): Network is unreachable (101) rsync error: error in socket IO (code 10) at clientserver.c(127) [Receiver=3.1.3]

Hi! Any way I can fix this? I'm running it in Ubuntu on Windows. :)

d-kleine commented 2 months ago

For Windows, easiest way is to use WSL with Ubuntu and install following packages on Ubuntu:

sudo apt-get update && \
sudo apt-get upgrade -y && \
sudo apt-get install -y python3-pip && \
sudo apt-get install -y python-is-python3 && \
sudo apt-get install -y rsync

and then run python get_data.py. This should work (at least for me, it did)