gojiplus / tuber

:sweet_potato: Access YouTube from R
http://gojiplus.github.io/tuber
Other
183 stars 55 forks source link

get_captions throws error #23

Open cschwem2er opened 7 years ago

cschwem2er commented 7 years ago

Hi,

first of all thank you so much for creating this package. Until now I relied on the excellent Youtube Data Tool and its awesome to finally have a proper way of working with youtube data in R. While your latest change, the possibility to get all comments for a video, works very well for me, I have trouble with catching captions:

captions <- list_caption_tracks(video_id = "C2p42GASnUo",  lang = "en")
captions

   videoId              lastUpdated trackKind language name audioTrackType  isCC isLarge isEasyReader isDraft
1 C2p42GASnUo 2017-05-15T19:57:13.795Z       ASR       es             unknown FALSE   FALSE        FALSE   FALSE
2 C2p42GASnUo 2017-05-10T20:31:26.261Z  standard       en             unknown FALSE   FALSE        FALSE   FALSE
3 C2p42GASnUo 2017-05-10T20:31:57.981Z  standard       es             unknown FALSE   FALSE        FALSE   FALSE
  isAutoSynced  status                                           id
1        FALSE serving gJ7PGz_Tj5zCuwb-GNgrENWFI7er-fGhLfdFQHboNkQ=
2        FALSE serving             rOkYSaba8sA5GlCHIbS8lpwv-XUSpRrX
3        FALSE serving             rOkYSaba8sCImiwNI2S1VG4VDJ2wJOHx

First, the language parameter does not seem to work as intended, as non-english results get returned as well. Second, pass one of the id's to list_caption_tracks does not work:

 caps <- get_captions(id="rOkYSaba8sA5GlCHIbS8lpwv-XUSpRrX")
Error: HTTP failure: 403
> caps

Do you have any idea what's going on here?

Edit: Is it possible that I need more permissions in addition to all youtube and the freebase api?

soodoku commented 7 years ago

First, you are welcome! I want tuber to be as good as it can be so your feedback is valuable! Thank you.

403 error is an 'access forbidden' error. Generally YT restricts access to captions to videos the person owns, which is a bit of a shame but it likely follows copyright law. More on YT errors here: https://developers.google.com/youtube/v3/docs/errors

Let me investigate a bit more and get back but that is my quick answer.

Changing the 'scope' of the authorization may help also.

danielcarter commented 7 years ago

EDIT 2: I tried to download the captions for about 600 videos and got captions for about 60 of them. It seems that there's an option for the video uploaders to allow third-party contributions to captions (https://stackoverflow.com/questions/30653865/downloading-captions-always-returns-a-403).

So, seems like there's no quick fix. If anybody comes up with another solution, I'd be very interested.

--

EDIT: The below worked for a while, but now I'm getting 403 errors as well. I can't tell if I'm hitting a rate limit or something else.

--

I've been working on this as well. I was able to request the captions without an auth error but got hex back. Here's what I'm doing that seems to work reasonably well. There's still some mess in the output text, but that can be cleaned up.

get_captions(id="____") #Note that this is the caption id returned by list_caption_tracks, not the video id

captions = as.character(captions) captions = paste(captions, sep="", collapse="")

captions = sapply(seq(1, nchar(cpations), by=2), function(x) substr(captions, x, x+1)) captions = rawToChar(as.raw(strtoi(captions, 16L)))

jobreu commented 3 years ago

It seems that you can only get captions for your own videos with the get_captions function. This appears to be due to a change made by YouTube. Previously you could use the API to get captions for videos that had third-party contributions enabled (see the StackOverflow discussion linked by @danielcarter above). However, the option of enabling those has been removed by YouTube. Also see #78. As an alternative, you can use the get_caption function from the youtubecaption package (https://github.com/jooyoungseo/youtubecaption).

soodoku commented 3 years ago

@jobreu thanks for this!

if the R package you mention that uses a python package underneath the hood works, do you not think we can get it to work here?

jobreu commented 3 years ago

Possibly so. The youtubecaption package uses the youtube-transcript-api Python library under the hood. It does not require setting up or providing YouTube API credentials. I haven't had a chance to look at the code for the Python library, so I don't know whether it uses the YouTube API (and if so with what credentials) or if it employs a screen scraping approach

soodoku commented 3 years ago

thanks for this @jobreu! looks like the python package is scraping the captions? the kind of thing plausibly that can google probably doesn't endorse.

jobreu commented 3 years ago

Yeah, it appears to scrape the comments. It's working for now but that may change if the package gets more (or too) popular or if the YouTube website changes.

soodoku commented 3 years ago

Thanks!

A general-purpose scraping solution is apify. I recommend using that for scraping jobs.

@jobreu can you do a PR on the readme explaining the current functionality and alternatives as lots of people want captions. Thank you sir!

jobreu commented 3 years ago

Done :-) I have also taken the liberty of removing the reference to the Freebase API in the README as that one has been deprecated.