hupili / python-for-data-and-media-communication-gitbook

An open source book on Python tailed for communication students with zero background
117 stars 62 forks source link

I want to enrich my data. #101

Closed Rita0719 closed 5 years ago

Rita0719 commented 5 years ago

Troubleshooting

I want to enrich my data. It seems that I have to click every single webpage in the first column "work name" to find more information to support my story.

https://archiveofourown.org/media/Movies/fandoms

Beside, as using this method I mentioned,I have trouble in finding kudos.

================ 现有的思路: 写两个爬虫,把两个爬虫都设置成函数

第一个爬虫收集https://archiveofourown.org/media/Movies/fandoms页面下的 书名 超链接 文章数 eg: 37771542968747_ pic

按照文章数量从高到低选出前100栏,生成第一个csv 1

第二个爬虫:提取出超链接这一栏,通过try函数,抓取这100个页面里第一页(20篇)的 words , hits

(第二个爬虫的csv有对应问题,因为一个原作名字要对应很多个文章名字)

Rita0719 commented 5 years ago

另一个思路:我换一个网站。可以吗……

hupili commented 5 years ago

Please try best to stick to the current topic. You will get something in the end.

According to current dataset from assignment1. You can at least do the following:

The purpose of assignment 2 is to exercise the basic workflow from raw data to statistics/ charts, and then make a reproducible report. Given the short period of time, I suggest to stick with the topic and try to meet the technical requirements first. Once you have a basic notebook online, we can further discuss how to enrich tomorrow.

hupili commented 5 years ago

At the mean time, you can also kick off new scrapers, or extend previous one to include more rows/ columns. We are help with the detailed issue.

ChicoXYC commented 5 years ago

@Rita0719 https://github.com/ChicoXYC/exercise/blob/master/student-cases/archiveofourown_movie.ipynb I've crawled all the single film urls here, you can use a for loop for scraping other information. Because I don't quite understand what you are going to scrape in the detailed pages. If you have further questions, please let me know.

Rita0719 commented 5 years ago

我现在的进度:

2018-11-24 16 22 35

找不全五栏

Rita0719 commented 5 years ago

problem solved! thanks!!