moghya / allitebooks

scraped www.allitebooks.com and index all the books available.
http://moghya.me/allitebooks
GNU General Public License v3.0
6 stars 5 forks source link

write scrapers for some other websites #4

Open moghya opened 7 years ago

moghya commented 7 years ago

Following websites can be scraped

  1. http://bookboon.com
cLupus commented 7 years ago

Such as?

moghya commented 7 years ago

@cLupus Thanks for showing interest in this project. I hope you visited http://moghya.me/allitebooks and got what we're trying to do here.

You can go through http://bookboon.com and try to wrtite scraper for it.

I'll add many such websites soon. Let me know if you gonna do it. I'll assign this to you :)

cLupus commented 7 years ago

I got to take a look, on the site, as well as in your repo. Am I correct in understanding that this issue is concerned with creating a scraper that creates a file similar to data.py?

moghya commented 7 years ago

Yes, you're correct. It's just we dump the dictionary in JSON and process that JSON.

cLupus commented 7 years ago

That does sound interesting. I assume the description should be in english. However, the site does offer some additional languiages, although not all the descriptions have been translated into the different languages. Is there any plan for localization (or at the very least to grab what's there in different languages)?

moghya commented 7 years ago

Honestly I didn't think of it. But as you have rightly raised we have to think about it ? What do you propose ?

cLupus commented 7 years ago

On closer inspection, it seems that only the site have been translated, and not the titles, or the descriptions, and such it would seem not to add much value (in the first run, anyway).

moghya commented 7 years ago

let's work it for English and we'll come up with solution in near future

cLupus commented 7 years ago

Another issue, is that http://bookboon.com 'locks' their books behind a dropdown, and do not offer direct links to their books. There are some ways to aliviate it

  1. Download the zip-files, and host them (somewhere) by link.
  2. Do some trickery with cookies that are sent along with the request
  3. Something else?
moghya commented 7 years ago

downloading zip one option but, maybe intercepting the request which downloads the book will solve our problem. Think it this way: scraper won't follow what bookboon, it'll work a step ahead we can workaround and get to know what exactly happens after filling the details and instead of filling the details we can directly send the request to download pdf.

EmilLuta commented 7 years ago

Hi there, ladies and gentlemen. What's the status on this issue? @moghya Mind if I hop in? Also, shouldn't the first page be a bit more descriptive? I.E. A huge majority of web pages should have written somewhere in the homepage what it is and what it does, not down in the code.

Let me know what you think!

moghya commented 7 years ago

@EmilLuta maybe you can contribute by working on #3.