Open raghavsi opened 6 years ago
On trying out the scrapy shell on the homepage 'https://www.pepperfry.com/', i.e. "scrapy shell https://www.pepperfry.com/ --nolog" I get a 403 access denied response.
As suggested both on the scrapy docs as well as SO, this is probably due to anti scraping measures taken by the site. Can anyone please confirm if they are facing the same issue, since I assume getting around such measures wasn't the intention of the assignment?
P.S. To ensure there was nothing wrong with my scrapy setup, I tried the same on other websites too, which gave no such problems.
@VarunSrivastavaIITD Add the corresponding user agent, and it will work while scraping from shell. https://stackoverflow.com/questions/48033398/unable-to-scrape-snapdeal-data-using-scrapy Can someone please elaborate the dataset format and the corresponding file structure?
@ankursharma-iitd Thanks a lot, it works now.
Please reach out to TAs also if you have issues. PepperFry_data/ Bench/ Item1/ Image1.jpg Image2.jpg ... metadata.txt
metadata can be just text, or it can be
Do we have to download just two images per object, or as many as are there?
@Maxaravind @VinayKyatham could you please tell us the deadline (time and date) for this assignment?
Hi all,
The deadline for submitting the assignment is 11:59 pm 12th September. Please make your submission as a tarball/zip file named as Assignment_2ELL881
Thanks
Is it mandatory to use scrapy? Can we not use bs4 or maybe selenium?
@utkarsh1097 You are free to choose a tool of your choice. The only restriction is that you should be able to create a python script that can do the task automatically. You are not supposed to use gui based tools for scrapping. You have to submit the python script for collecting data as part of the assignment. So whatever python library is OK. But we highly recommend that you go with scrapy.
Please give the directory structure for submission as I am bit confused. What do I have to submit? Python code or meta_data or both? Also please specify the ID where we have to sumit the data.
@anshumitts
This is the directory structure:-
PepperFry_data/ --Top level directory Bench/ -- Category Item1/ --Items Image1.jpg Image2.jpg ... metadata.txt --one and only one should be present in each item folder
And regarding the submission, we will update you soon.
@Maxaravind @VinayKyatham there is an issue in the scrapy image pipeline - it cannot download many of the images in pepperfry.com, giving the following error:
OSError: cannot identify image file <_io.BytesIO object at 0x000001E3E7A94678>
When I request the image URL and try to write the body of the request to image file, the image is not recognized by windows image viewer, but can be viewed by VScode.
However, when I use urllib to download the image and write the response to file, it is readable by windows image viewer. The problem with this approach is that it is much much slower than downloading using the request mechanism of scrapy.
One of the images for which this issue was coming: https://ii1.pepperfry.com/media/catalog/product/l/o/494x544/lounge-chair-in-high-quality-wicker-by-ventura-lounge-chair-in-high-quality-wicker-by-ventura-ertieh.jpg
Interestingly, even when I download this image from browser, it is not recognized by Windows Image viewer (or even PIL in python) but is opened in VScode.
(I am thinking it could also be an issue with some images in pepperfry.com, as I have not seen this problem yet on other websites)
Turns out the images are webp images, and the Pillow installed on my system isn't able to read them. An online tool (+Google Chrome) was able to read it as webp, so I am assuming this is not a problem.
Scrape pepperfry to create a dataset of these categories: 2-seater-sofa bench book-cases coffee-table dining-set queen-beds arm-chairs chest-drawers garden-seating bean-bags king-beds
For each category I want students to extract upto 20 items (and not less than 10 items). For each item download more than one image (each item has multiple images in different pose) and whatever meta-data is available. (meta data may also be available in url)
To do this we usually use scrapy: https://doc.scrapy.org/en/latest/intro/tutorial.html
The idea is to find for each category a link, from each link a set of links for items and recursively parse them.
The result should be dumped in a file structure: category_name_dir item_name_dir item_image_1 item_image_2 metadata.txt
Deadline for this is next Wednesday : 12 Sept 2018. Please reach out to TAs for help and submission guidelines. Post here for clarification.