Complete Scrapping of Gutenberg

neokd / DataStorehouse

DataStoreHouse is an open-source project that aims to create a collaborative platform for gathering and sharing a wide variety of datasets. It provides a centralised repository where individuals and organisations can contribute, discover, and collaborate on diverse datasets for various domains.

https://datash.vercel.app

MIT License

18 stars 22 forks source link

Complete Scrapping of Gutenberg #78

Closed VigneshRamanathan101 closed 1 year ago

VigneshRamanathan101 commented 1 year ago

Description

Currently Gutenberg website has around 72K books. This issue is to scrape all the books

Expected Behavior

gutenberg_bibliographic_records.json should have around 72 K records

Current Behavior

Today gutenberg_bibliographic_records.json has around 13K records

Possible Solution (optional)

Run Gutenberg scrapper till we scrape all the records

VigneshRamanathan101 commented 1 year ago

Any new commers can take up this.

neokd commented 1 year ago

What is the issue @vigneshRamanathan3105 ?

VigneshRamanathan101 commented 1 year ago

We have no issue @neokd .

we only need to run the scrapper till we scrape all the books. Which might take several hours (due to huge no. of books)

As of now the latest book is John's Lily

neokd commented 1 year ago

Thanks for your contribution @VigneshRamanathan101 We'll try scrapping more books