sgraaf / Replicate-Toronto-BookCorpus

This repository contains code to replicate the no-longer publicly available Toronto BookCorpus dataset
GNU General Public License v3.0
48 stars 12 forks source link

update get_book_urls.py - for URI issue on Mac #4

Closed ghost closed 4 years ago

ghost commented 4 years ago

The content of book download URLs was missing the fully qualified domain. As taken from page source on Smashwords:

Written to file as: /books/download/481848/6/latest/0/0/someone-to-love-me.txt Adding the FQHN as a variable solves this.