philbot9 / youtube-comment-scraper

A web client that scrapes YouTube comments
http://ytcomments.klostermann.ca
ISC License
244 stars 65 forks source link

Investigate the feasibility of using the YouTube API #5

Open jchrom opened 6 years ago

jchrom commented 6 years ago

I assumed that the timestamp is based on the unix epoch (number of seconds since 1970-01-01, only here I guessed it is miliseconds, judging by the number of digits).

The weird thing is that when I look at the histogram of comment times, Ghostbusters trailer appears to have most of them around 29-30 February 2017, which does not make a lot of sense to me. I would expect most of the comments to appear shortly after the video is published, or the premiere of the movie, which both happened much, much earlier. The trailer was apparently published in March 2016 (at least that's what the description says).

Could this be a timestamp issue?

philbot9 commented 6 years ago

Hi there,

Thanks for your report. You're right, the timestamp is in milliseconds.

Unfortunately, YouTube does not show an exact time for each comment. We only get relative times ("1 hour ago", "4 weeks ago" "1 year ago", etc). So the timestamp is based on that relative time.

Since this is an older video there will be a lot of comments listed as "1 year ago", which would put them at roughly the end of February. I think that explains what you're seeing in the histogram.

As far as I know there is no way to retrieve the absolute time for a comment from the YouTube website, so the current timestamp is the best solution under these circumstances.

jchrom commented 6 years ago

Thanks for the answer.

I wonder, could this be relevant? Especially this part:

"publishedAt": datetime,
"updatedAt": datetime
jchrom commented 6 years ago

@philbot9 not sure if you notice with this issue closed :)

philbot9 commented 6 years ago

Thanks for this. It would be relevant if this was using the API. However, this project scrapes the comments directly from the website (hence the name youtube-comment-scraper 😜). As I said, the timestamp information is not available on the website.

The reason it's not using the YouTube API is that the quota limits were too strict for this project at the time. It seems they have since been relaxed a little, so it might be worth investigating whether the API can be a feasible alternative.

https://developers.google.com/youtube/v3/getting-started#quota

The problem really is scalability. With over 1,500 users a month, I'm not convinced it's going to be a good long-term solution. And once the quota is exceeded this project is dead in the water until the quota resets after a month. I doubt YouTube will grant an increased quota to a comment scraper.

I'll open this back up to investigate whether there's a way to "game the system" a little bit.

jchrom commented 6 years ago

I concede that the project is sensibly named :stuck_out_tongue_winking_eye:

I can see the advantage of scraping over API. Perhaps it would be best not to duplicate efforts, as there is already a nice project based on API called YouTube Data Tools. As far as I can tell, the data it fetches is equivalent to yours, only with a correct timestamp, and it takes a bit longer. Ghostbusters trailer (270k+ comments including nested) took me around 4 hours with your tool and an about 1 hour more with YTDT (not very precise).

That being said, a useful change (I think) would be to rename or drop the "timestamp" column from your output. It made me think I had data I really didn't have, and it does not seem very useful to me.

jchrom commented 6 years ago

Another issue with YouTube Data Tools is that it only downloaded ~75% of the comments, in comparison to your tool's ~99%.

d0tN3t commented 6 years ago

I really like your scraper. I managed to get both batch and multiprocess comment scraping for an entire user's channel. But it's in Python. I did notice that when a user uploads a recently live video that I'm not able to scrape the comments. I'm not sure why? I thought maybe you could help shed some light.