gojiplus / tuber

:sweet_potato: Access YouTube from R
http://gojiplus.github.io/tuber
Other
184 stars 55 forks source link

get_all_comments does not return max results #52

Open Roechiiii opened 6 years ago

Roechiiii commented 6 years ago

Hi, first of all thank you for your awesome R package to scrape for comments at youtube. I am using your package to analyse some comments, but I came up with the problem that not all comments can be collected. I think the issue have been mentioned in other issues (get_comment_threads) as well but this problem focuses on the method "get_all_comments". The original video has 3040 comments, the function returns only 2335 records, so approximately 30% get lost. The bigger problem in my opinion is the returning of the replies. Looking at the user in "top comment" category it can be seen that the original video counts 34 different replies, the function returns only 5, so the communication between different users will be lost.

comments <- get_all_comments(video_id = "zz-RpiUFY-I")

soodoku commented 6 years ago

Thanks for diagnosing this. Need to investigate why I am not getting more than 5 replies. Super weird. But confirmed that it is true. Aargh!

When youtube counts total comments, it also counts replies. So the discrepancy is very likely just driven by not fetching more than 5.

Roechiiii commented 6 years ago

You are welcome! Yes absolutely, I was wondering why the maximum amount of replies is always between 0-5. The following example on https://stackoverflow.com/questions/29692972/youtube-comment-scraper-returns-limited-results/29871427#29871427 describes a similar problem. In this example the pageToken of the current request is returned as the previous requests nextPageToken to update the session. I am note sure if you already implemented it in your package, but maybe it will help you.

soodoku commented 6 years ago

The function iterates over pages of results. So that isn't a problem. There may be an issue with basically getting replies of replies. Will investigate this.

rangaro commented 6 years ago

I stumbled upon the same issue. Interestingly, other tools like Webometric Analyst and YouTube Data Tools also do not return the maximum of comments, but the discrepancy is way smaller (e.g., 2.276 of 2.287 comments).

So I am curious if there will be any fix soon?

rainbowfan commented 6 years ago

Hi, I would like to follow up on the issue above and see whether there's a solution. There are three different scenarios that I encountered. 1. As what was discussed above, replies are not extracted completely using get_all_comments function. 2. I found a youtube video (id: 49Ilvc8WiG8) in which there is no comment but shows 1 comment in total. This is not a bug from the package, but if someone can give me a hint, that would be greatly appreciated. 3. Some hidden comments are being extracted. For example, for video bK6DVXty0gQ I extracted two more comments which were hidden on that video page. Does this mean the author of the channel deleted those comments or someone reported issues on thost comments?

Thank you!

chspoerlein commented 6 years ago

Hi soodoku, thanks for your amazing work and service to the community! I found the same issue as Roechiiii where a maximum of 5 replies per comment get extracted. I remember some wrapper function for extracting reddit comments where one had to explicitely code that the function presses "show more". Could this be an issue here?

leftyveggie commented 6 years ago

Hi Soodoku - first of all - thank you to you and your co-contributors for making this excellent package. I think I have a possible work around for the replies issue (up to 100 replies). If you do not unlist totalReplyCount (lines 63-66), then we can more easily identify those comments with replies (e.g. df[!(df$totalReplyCount=="0"),] and use your get_comments(filter = c(parent_id = x)) to get the (in most cases) complete threads.

Would possibly be something that you might be able to do? Or is there a more important reason why total reply count is deleted?

soodoku commented 6 years ago

Thanks for the hint @leftyveggie! Will try it over the weekend if that works for you.

leftyveggie commented 6 years ago

@soodoku I think maybe I have slightly misunderstood - perhaps it is not at this part of the code - I meant if you could change the output of the data frame to include totalReplyCount as a column! Thank you for the quick reply!

voskresenskiy commented 5 years ago

Hi! Thanks for the great package! Was the reason for not downloading all comments identified? Don't we get some replies? Sorry for disturbing)

orientune commented 4 years ago

I tried for different videos and my results here. If one author has only replies (doesnt have comment about video) ,then according to number of replies behave.If the author has not more than 5 replies then dont scrape anyone.But if has more than 5 replies then some comments are scraping. And if one author has both himself comments and replies then more than second man (up I told) comments are scraping.

rangaro commented 4 years ago

On the 22nd of February 2019, we did a test run with the Dayum video (DcJFdCmN98s). The video page informed us to expect 47,163 comments. YouTube Data Tools from Bernhard Rieder extracted 47,153 comments (10 missings). However, tuber extracted 44,810 comments, and Webometric Analyst from Mike Thelwall extracted 44,828.

Webometric Analyst only retrieves five follow-up comments because it does not take the comment pagination into account. The tuber results are pretty close. I think the iteration through the replies pages is not working correctly in tuber. Maybe Bernhard Rieder can be asked how he solved the problem in his tool. https://twitter.com/riederb

rangaro commented 3 years ago

Is there any update on this? The bug report is now nearly 3 years old.

balthasars commented 3 years ago

Correct me if I'm wrong but to me it appears that get_all_comments() only implements the query to the commentThreads resource: https://developers.google.com/youtube/v3/docs/commentThreads

The resource states

A commentThread resource contains information about a YouTube comment thread, which comprises a top-level comment and replies, if any exist, to that comment.

So far so good, but it then goes on saying

The commentThread resource does not necessarily (!) contain all replies to a comment, and you need to use the comments.list method if you want to retrieve all replies for a particular comment. Also note that some comments do not have replies.

So I believe additional queries to this resource need to be implemented to retrieve the replies to all comments. I don't see any other GET queries in the source code apart from those to the commentThreads resource.

So this is just a wild guess (to be completely honest, I don't fully understand the code of process_page() yet, but could this be the issue here?

hamaer0214 commented 2 years ago

I am in the same problem today. I use youtube API to get all comments, but only 0-5 replies max. And I didn't find any clue on the youtube API page.