Closed Wong-Ming closed 6 years ago
@Wong-Ming Please find all elements/retaurants first, every element include all info of one restaurant.Then you can use a for loop to get the attributes you want. The following is for you consideration.
from bs4 import BeautifulSoup
import requests
url='https://en.tripadvisor.com.hk/Restaurants-g294217-Hong_Kong.html'
r=requests.get(url)
soup=BeautifulSoup(r.text)
restaurants = []
for element in soup.find_all('div',attrs={'class':'ui_columns is-mobile'}): #find all elements first, every element include all info of one restaurant
restaurant = {}
restaurant['title']=element.find('div',attrs={'class':"title"}).text.strip()
restaurant['rating']=element.find('div',attrs={'class':'rating rebrand'}).find('span',attrs={'class':'ui_bubble_rating bubble_45'})['alt']
# restaurant['review'] = ...
# restaurant['popIndex'] =
restaurants.append(restaurant)
Also refer to item-first approach: https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-05.md#item-first-vs-attribute-first
Troubleshooting
Different types of data (names, types, numbers of reviews, etc., as the website crawled is TripAdvisor) came up in one piece of text and in rows, stopping further attempts in ranking in rows with different columns showing different specs in cvs file. Also, irrelevant texts that should be in other parts of the site were included. Should be because of wrong coding/filtering in the previous steps but couldn't define which parts to be included.
Describe your environment
Describe your question
Where to look for the exact tags needed? How to ensure data came in different section to be written in columns instead of rows?
Example: I get IOError when running my script to load files.
The minimum code (snippet) to reproduce the issue
Example:
Describe the efforts you have spent on this issue
Googled similar attempts to exclude the irrelevant urls, hrefs in getting the texts of it. But cannot help excluding the other texts not needed.
Example:
Have you Google/ Stackover flow anything?
Do they solve or partially solve your question?
What is the closest answer you can find?