Open ba11b0y opened 6 years ago
@invinciblycool I like the thought, I would suggest, a detailed list of missing components you find in the current code of scraper, then we will assign you the work.
@invinciblycool XML
format could be added.
@ashwini0529 I have added the XML response to web.py
. Let me know if any corrections are needed
@shubhodeep9 I will update the detailed list as soon as my exams get over :smile:
@ashwini0529 @shubhodeep9 Couldn't resist the excitement :smile: These are some features in my mind which can be added :
{
"assets":
{
"images":
[
"link of image1 on the page",
"link of image2 on the page"
],
"videos":
[
"link to embedded video1",
"link to embedded video2"
]
},
"content":
{
"text": "all raw text from the page",
"html": "all html from the page"
}
}
2) Or creates dedicated directories for the above keys of the dictionaries and actually saves the content to the respective directory.(Inspired from httrack)
web.scrape(url, scrape_content = "images")
returns all the links to images in or saves the images locally.Hey @invinciblycool Sounds good. Sounds like a great idea to start with. Go ahead. We can add more features. 🎉
@invinciblycool Add a TO-DO with your PR, and we will keep this issue alive until we feel satisfied. So that whenever someone gets a new idea on web-scraping, they can add to that TO-DO
Also, please add a [WIP] tag in your PR message. 😄
@ashwini0529 To start working if you could make it clear that should the function be returning a response or should create folders and save the content locally. Thanks. @shubhodeep9 Just confirming a TO-DO with the PR or the issue.
Hey @invinciblycool you can take a look at the QR Code function. I think you can make something like that.
Probable usage like what it was for QRCode
:
img = hackr.image.qrcode("https://github.com/pytorn/hackr", dest_path="/tmp/hackr_qrcode.png")
I guess then we agree on saving all the content locally. Will start working on it ASAP.
Hey @invinciblycool Updates?
Sorry for the delay, I will try opening a PR by this week. Happy Diwali BTW. :sparkles:
Perfect @invinciblycool Happy hacking and Happy Diwali! 😄 🎇
There hasn't been much work on the web scraping part. I am interested to work on this. Since this is going to be a generic one, what I have thought as of now includes: 1) A generic web scraper which scrapes all images, links and the text. 2) Use scrapy for this maybe.
Still a beginner, any tips or corrections?