hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background
118 stars 62 forks source link

Assignment 1-Data of different kinds pop up all at once in rows #70

Closed Wong-Ming closed 6 years ago

Wong-Ming commented 6 years ago

Troubleshooting

Different types of data (names, types, numbers of reviews, etc., as the website crawled is TripAdvisor) came up in one piece of text and in rows, stopping further attempts in ranking in rows with different columns showing different specs in cvs file. Also, irrelevant texts that should be in other parts of the site were included. Should be because of wrong coding/filtering in the previous steps but couldn't define which parts to be included.

Describe your environment

Describe your question

Where to look for the exact tags needed? How to ensure data came in different section to be written in columns instead of rows?

Example: I get IOError when running my script to load files.

The minimum code (snippet) to reproduce the issue

Example:

from bs4 import BeautifulSoup
url='https://en.tripadvisor.com.hk/Restaurants-g294217-Hong_Kong.html'
import requests
r=requests.get(url)
soup=BeautifulSoup(r.text)
for text in soup.find_all('a'):
    print(text.get_text())

Describe the efforts you have spent on this issue

Googled similar attempts to exclude the irrelevant urls, hrefs in getting the texts of it. But cannot help excluding the other texts not needed.

Example:

Have you Google/ Stackover flow anything?

Do they solve or partially solve your question?

What is the closest answer you can find?

ChicoXYC commented 6 years ago

@Wong-Ming Please find all elements/retaurants first, every element include all info of one restaurant.Then you can use a for loop to get the attributes you want. The following is for you consideration.

from bs4 import BeautifulSoup
import requests
url='https://en.tripadvisor.com.hk/Restaurants-g294217-Hong_Kong.html'
r=requests.get(url)
soup=BeautifulSoup(r.text)
restaurants = []
for element in soup.find_all('div',attrs={'class':'ui_columns is-mobile'}): #find all elements first, every element include all info of one restaurant
    restaurant = {}
    restaurant['title']=element.find('div',attrs={'class':"title"}).text.strip()
    restaurant['rating']=element.find('div',attrs={'class':'rating rebrand'}).find('span',attrs={'class':'ui_bubble_rating bubble_45'})['alt']
#     restaurant['review'] = ...
#     restaurant['popIndex'] =
    restaurants.append(restaurant)
hupili commented 6 years ago

Also refer to item-first approach: https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-05.md#item-first-vs-attribute-first