hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background
115 stars 62 forks source link

Feedback & Questions in Week5 #31

Closed ConnorLi96 closed 6 years ago

ConnorLi96 commented 6 years ago

I have to admit that this chapter is really more difficult, but also useful, which help me to build logic thinking for scraper and how to achieve it in practical application step by step. The feedback and questions are listed below according to chapter order.

1. In Get data:

(1) You seem to have lost the underline inmy_title = myh1.text . Without underline '_', this code will face NameError.

(2) Still in this code, I cannot understand what Type(myh1) means ? Maybe we can change the myh1 to h2 such as my_h1 = data.find('h2') and get output '話癆特朗普', which might help others to understand the target that using tag and attributes to extract the data we want directly.

2. In Get author try 2

(1) How do we determine the tag_name? Just like the 'tr', I'm wondering the regulations because you use 'a' as tag_name in the latter function _scrape_articles_urls_of_onepage

(2)This code seem likes a dictattrs={'class':"post__authors"}, why use this format, could you explain it more detailed, or is it just syntax rules?

Thanks for all your work and help, it's meaningful !

ChicoXYC commented 6 years ago

@ConnorLi96 thanks so much. The following are the explanation about the question you ask.

  1. add an explanation of type sth. see details 8177122.

(1) All data or information is stored in the HTML tags. Tag names are settled by the website creators, which always appear as pairs. So, all we need to do is to find the tags that contain our required data. For example, the article titles are usually in h1, and texts are usually in p. You can find those tags by using Chrome DevTools, which we talked about this at the beginning of the chapter.

(2) for example:

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

In this a tag, there are some attributes like class, id. Those attributes are used to distinguish this tag from other similar tags, especially when there are many tags in the HTML page. So, if you want to locate or find sth. precisely, you can find those attributes specifically by writing it as xxx.find('tag_name',attrs={'attributes':'values'}). Example details please see ae07a51.

For more similar find function syntax, please see bs4 documentation.