Closed Rita0719 closed 6 years ago
另一个思路:我换一个网站。可以吗……
Please try best to stick to the current topic. You will get something in the end.
According to current dataset from assignment1. You can at least do the following:
The purpose of assignment 2 is to exercise the basic workflow from raw data to statistics/ charts, and then make a reproducible report. Given the short period of time, I suggest to stick with the topic and try to meet the technical requirements first. Once you have a basic notebook online, we can further discuss how to enrich tomorrow.
At the mean time, you can also kick off new scrapers, or extend previous one to include more rows/ columns. We are help with the detailed issue.
@Rita0719 https://github.com/ChicoXYC/exercise/blob/master/student-cases/archiveofourown_movie.ipynb I've crawled all the single film urls here, you can use a for loop for scraping other information. Because I don't quite understand what you are going to scrape in the detailed pages. If you have further questions, please let me know.
我现在的进度:
找不全五栏
problem solved! thanks!!
Troubleshooting
I want to enrich my data. It seems that I have to click every single webpage in the first column "work name" to find more information to support my story.
https://archiveofourown.org/media/Movies/fandoms
Beside, as using this method I mentioned,I have trouble in finding kudos.
================ 现有的思路: 写两个爬虫,把两个爬虫都设置成函数
第一个爬虫收集https://archiveofourown.org/media/Movies/fandoms页面下的 书名 超链接 文章数 eg:
按照文章数量从高到低选出前100栏,生成第一个csv 1
第二个爬虫:提取出超链接这一栏,通过try函数,抓取这100个页面里第一页(20篇)的 words , hits
(第二个爬虫的csv有对应问题,因为一个原作名字要对应很多个文章名字)