vineetm / ell-881-2018-deep-learning

Course Materials for ELL 881 2018: Fundamentals of Deep Learning
9 stars 7 forks source link

Assignment 2 #6

Open raghavsi opened 6 years ago

raghavsi commented 6 years ago

Scrape pepperfry to create a dataset of these categories: 2-seater-sofa bench book-cases coffee-table dining-set queen-beds arm-chairs chest-drawers garden-seating bean-bags king-beds

For each category I want students to extract upto 20 items (and not less than 10 items). For each item download more than one image (each item has multiple images in different pose) and whatever meta-data is available. (meta data may also be available in url)

To do this we usually use scrapy: https://doc.scrapy.org/en/latest/intro/tutorial.html

The idea is to find for each category a link, from each link a set of links for items and recursively parse them.

The result should be dumped in a file structure: category_name_dir item_name_dir item_image_1 item_image_2 metadata.txt

Deadline for this is next Wednesday : 12 Sept 2018. Please reach out to TAs for help and submission guidelines. Post here for clarification.

VarunSrivastavaIITD commented 6 years ago

On trying out the scrapy shell on the homepage 'https://www.pepperfry.com/', i.e. "scrapy shell https://www.pepperfry.com/ --nolog" I get a 403 access denied response.

As suggested both on the scrapy docs as well as SO, this is probably due to anti scraping measures taken by the site. Can anyone please confirm if they are facing the same issue, since I assume getting around such measures wasn't the intention of the assignment?

P.S. To ensure there was nothing wrong with my scrapy setup, I tried the same on other websites too, which gave no such problems.

ankursharma-iitd commented 6 years ago

@VarunSrivastavaIITD Add the corresponding user agent, and it will work while scraping from shell. https://stackoverflow.com/questions/48033398/unable-to-scrape-snapdeal-data-using-scrapy Can someone please elaborate the dataset format and the corresponding file structure?

VarunSrivastavaIITD commented 6 years ago

@ankursharma-iitd Thanks a lot, it works now.

raghavsi commented 6 years ago

Please reach out to TAs also if you have issues. PepperFry_data/ Bench/ Item1/ Image1.jpg Image2.jpg ... metadata.txt

metadata can be just text, or it can be if tags are available If metadata is available or stored in json format, that is also acceptable. BTW there may be some metadata available for each image, e.g. image1 is "front", image2 is "back" -- that could also become useful.

abhudev commented 6 years ago

Do we have to download just two images per object, or as many as are there?

Adi-iitd commented 6 years ago

@Maxaravind @VinayKyatham could you please tell us the deadline (time and date) for this assignment?

Maxaravind commented 6 years ago

Hi all,

The deadline for submitting the assignment is 11:59 pm 12th September. Please make your submission as a tarball/zip file named as Assignment_2ELL881. Please make the subject of your email as Assignment_2_ELL881.

Thanks

utkarsh1097 commented 6 years ago

Is it mandatory to use scrapy? Can we not use bs4 or maybe selenium?

Maxaravind commented 6 years ago

@utkarsh1097 You are free to choose a tool of your choice. The only restriction is that you should be able to create a python script that can do the task automatically. You are not supposed to use gui based tools for scrapping. You have to submit the python script for collecting data as part of the assignment. So whatever python library is OK. But we highly recommend that you go with scrapy.

anshumitts commented 6 years ago

Please give the directory structure for submission as I am bit confused. What do I have to submit? Python code or meta_data or both? Also please specify the ID where we have to sumit the data.

Maxaravind commented 6 years ago

@anshumitts

This is the directory structure:-

PepperFry_data/ --Top level directory Bench/ -- Category Item1/ --Items Image1.jpg Image2.jpg ... metadata.txt --one and only one should be present in each item folder

And regarding the submission, we will update you soon.

anshumitts commented 6 years ago

@Maxaravind here it is mentioned we have to submit just one script. and what is expected in metadata.txt. It is not clear here. also isn't meta for one class instead of 1 item?

abhudev commented 6 years ago

@Maxaravind @VinayKyatham there is an issue in the scrapy image pipeline - it cannot download many of the images in pepperfry.com, giving the following error:

OSError: cannot identify image file <_io.BytesIO object at 0x000001E3E7A94678>

When I request the image URL and try to write the body of the request to image file, the image is not recognized by windows image viewer, but can be viewed by VScode.

However, when I use urllib to download the image and write the response to file, it is readable by windows image viewer. The problem with this approach is that it is much much slower than downloading using the request mechanism of scrapy.

One of the images for which this issue was coming: https://ii1.pepperfry.com/media/catalog/product/l/o/494x544/lounge-chair-in-high-quality-wicker-by-ventura-lounge-chair-in-high-quality-wicker-by-ventura-ertieh.jpg

Interestingly, even when I download this image from browser, it is not recognized by Windows Image viewer (or even PIL in python) but is opened in VScode.

(I am thinking it could also be an issue with some images in pepperfry.com, as I have not seen this problem yet on other websites)

abhudev commented 6 years ago

Turns out the images are webp images, and the Pillow installed on my system isn't able to read them. An online tool (+Google Chrome) was able to read it as webp, so I am assuming this is not a problem.