Extending the web scraper.

ba11b0y commented 6 years ago

There hasn't been much work on the web scraping part. I am interested to work on this. Since this is going to be a generic one, what I have thought as of now includes: 1) A generic web scraper which scrapes all images, links and the text. 2) Use scrapy for this maybe.

Still a beginner, any tips or corrections?

shubhodeep9 commented 6 years ago

@invinciblycool I like the thought, I would suggest, a detailed list of missing components you find in the current code of scraper, then we will assign you the work.

ashwini0529 commented 6 years ago

@invinciblycool XML format could be added.

ba11b0y commented 6 years ago

@ashwini0529 I have added the XML response to web.py. Let me know if any corrections are needed @shubhodeep9 I will update the detailed list as soon as my exams get over :smile:

ba11b0y commented 6 years ago

@ashwini0529 @shubhodeep9 Couldn't resist the excitement :smile: These are some features in my mind which can be added :

[ ] If no JSON response is returned by the URL, only the source of the page is returned. We could have a more better scraper which returns either: 1) A dictionary or a JSON reponse:

{
"assets":
{
"images":
[
  "link of image1 on the page",
  "link of image2 on the page"
],
"videos":
[
  "link to embedded video1",
  "link to embedded video2"
]
},
"content":
{
"text": "all raw text from the page",
"html": "all html from the page"
}
}

2) Or creates dedicated directories for the above keys of the dictionaries and actually saves the content to the respective directory.(Inspired from httrack)

[x] Another feature could be adding a specific scrape option. For Example: web.scrape(url, scrape_content = "images") returns all the links to images in or saves the images locally.

ashwini0529 commented 6 years ago

Hey @invinciblycool Sounds good. Sounds like a great idea to start with. Go ahead. We can add more features. 🎉

shubhodeep9 commented 6 years ago

@invinciblycool Add a TO-DO with your PR, and we will keep this issue alive until we feel satisfied. So that whenever someone gets a new idea on web-scraping, they can add to that TO-DO

ashwini0529 commented 6 years ago

Also, please add a [WIP] tag in your PR message. 😄

ba11b0y commented 6 years ago

@ashwini0529 To start working if you could make it clear that should the function be returning a response or should create folders and save the content locally. Thanks. @shubhodeep9 Just confirming a TO-DO with the PR or the issue.

ashwini0529 commented 6 years ago

Hey @invinciblycool you can take a look at the QR Code function. I think you can make something like that. Probable usage like what it was for QRCode: img = hackr.image.qrcode("https://github.com/pytorn/hackr", dest_path="/tmp/hackr_qrcode.png")

ba11b0y commented 6 years ago

I guess then we agree on saving all the content locally. Will start working on it ASAP.

ashwini0529 commented 6 years ago

Hey @invinciblycool Updates?

ba11b0y commented 6 years ago

Sorry for the delay, I will try opening a PR by this week. Happy Diwali BTW. :sparkles:

ashwini0529 commented 6 years ago

Perfect @invinciblycool Happy hacking and Happy Diwali! 😄 🎇

pytorn / hackr

Extending the web scraper. #18