Closed metrast closed 1 year ago
Possibly related to https://github.com/TeamNewPipe/NewPipe/issues/9223
Relevant invidious issue as well as pull request. Problem is due to a new continuation token format as well as a new JSON response format. They claim arbitrary paging is not possible with the new ctoken format but I think it might be
Not being able to access old channel videos is getting annyoing. So I played with the continuation tokens a bit and this is where I'm at:
def channel_ctoken_v4(channel_id, page, sort, tab, view=1):
pointless_nest = proto.string(80226972,
proto.string(2, channel_id) +
proto.string(3,
proto.percent_b64encode(
proto.string(110,
proto.string(3,
proto.string(15,
proto.string(1,
proto.string(1,
proto.unpadded_b64encode(
proto.string(1,
proto.unpadded_b64encode(
proto.string(2,
b"ST:" +
proto.unpadded_b64encode(
proto.string(2, "150") # window start
)
)
)
) +
proto.string(2,
# some checksum which has to match
# so that page offset is accepted?
proto.uint(1, 13421533910578046448)) +
proto.uint(5, 50) + # window size?
proto.uint(6, 6) + # page of 30
proto.uint(7, 180) + # offset in steps of 30 inside window
proto.string(8,
# seconds since epoch initial
# request, static across successive requests
proto.uint(1, 1676803892) +
# unknown but static across successive tokens
proto.uint(2, 197625613)) +
proto.uint(9, 6) + # page of 30
proto.uint(10, 150) # window start
)
) +
proto.string(2, "63faaff0-0000-23fe-80f0-582429d11c38") #targetId
) +
proto.uint(3, 1) # 1 - newest, 2 - popular
)
)
)
)
)
)
return base64.urlsafe_b64encode(pointless_nest).decode('ascii')
It seems a sliding window was added on top of the page concept. Merits and behaviour unclear. The page offset has to lie inside the sliding window and the checksum in field 2 has to match for it to be accepted. targetId
can be harvested from the initial request response. tab and view info seem gone.
Almost all of that is optional and I am able to arbitrarily retrieve "pages" of 30 videos by just moving the sliding window start offset in that funky ST: string like so:
pointless_nest = proto.string(80226972,
proto.string(2, channel_id) +
proto.string(3,
proto.percent_b64encode(
proto.string(110,
proto.string(3,
proto.string(15,
proto.string(1,
proto.string(1,
proto.unpadded_b64encode(
proto.string(1,
proto.unpadded_b64encode(
proto.string(2,
b"ST:" +
proto.unpadded_b64encode(
# get 30 videos starting with number 4 (counting starts at 0)
proto.string(2, "3")
)
)
)
)
)
) +
proto.string(2, "63faaff0-0000-23fe-80f0-582429d11c38") #targetId
) +
proto.uint(3, 1) # 1 - newest, 2 - popular
)
)
)
)
)
)
return base64.urlsafe_b64encode(pointless_nest).decode('ascii')
Am I on the right track here?
We could go ahead and just work with the window start offset or we find out what field 2 actually is and calculate and provide it the expected way.
@michaelweiser Thanks for this, this is very helpful. Are you able to skip pages (such as requesting the last page)? Or does it restrict you to requesting pages sequentially?
It appears I can arbitrarily index into the video list without any previous knowledge or replayed request data just by synthesizing the continuation token locally (appart from the targetId). I've created a small test program which works like so (parameter is the window start):
$ ../bin/python3 t.py 2 | grep "^ \"videoId" | head -3
"videoId": "_iZaJKOcwPo",
"videoId": "IquHfs7H8Bk",
"videoId": "QrVL3g4iS1c",
$ ../bin/python3 t.py 0 | grep "^ \"videoId" | head -3
"videoId": "m3_aGlMk9a8",
"videoId": "OBF17olqeOU",
"videoId": "_iZaJKOcwPo",
$ ../bin/python3 t.py 1 | grep "^ \"videoId" | head -3
"videoId": "OBF17olqeOU",
"videoId": "_iZaJKOcwPo",
"videoId": "IquHfs7H8Bk",
$ ../bin/python3 t.py 2 | grep "^ \"videoId" | head -3
"videoId": "_iZaJKOcwPo",
"videoId": "IquHfs7H8Bk",
"videoId": "QrVL3g4iS1c",
$ ../bin/python3 t.py 3 | grep "^ \"videoId" | head -3
"videoId": "IquHfs7H8Bk",
"videoId": "QrVL3g4iS1c",
"videoId": "tEBCNTqJSuc",
$ ../bin/python3 t.py 2 | grep "^ \"videoId" | head -3
"videoId": "_iZaJKOcwPo",
"videoId": "IquHfs7H8Bk",
"videoId": "QrVL3g4iS1c",
$ ../bin/python3 t.py 1 | grep "^ \"videoId" | head -3
"videoId": "OBF17olqeOU",
"videoId": "_iZaJKOcwPo",
"videoId": "IquHfs7H8Bk",
Knowing the number of videos in the channel, I can work my way backwards. In this case, youtube-local displays 327 as number and this is what I get:
# ../bin/python3 t.py 327 | grep "^ \"videoId" | head -3
# ../bin/python3 t.py 326 | grep "^ \"videoId" | head -3
# ../bin/python3 t.py 325 | grep "^ \"videoId" | head -3
# ../bin/python3 t.py 324 | grep "^ \"videoId" | head -3
# ../bin/python3 t.py 320 | grep "^ \"videoId" | head -3
# ../bin/python3 t.py 310 | grep "^ \"videoId" | head -3
"videoId": "rhzmNRtIp8k",
"videoId": "fTaOlBWcl48",
"videoId": "4eNBM17tkjI",
# ../bin/python3 t.py 312 | grep "^ \"videoId" | head -3
"videoId": "4eNBM17tkjI",
"videoId": "eZIjxGY3Kok",
# ../bin/python3 t.py 313 | grep "^ \"videoId" | head -3
"videoId": "eZIjxGY3Kok",
I don't know where that offset of 13 (assuming zero-based indexing, i.e. index 326 should be video 327) is coming from.
This is the program:
channel_id = "UCi2KNss4Yx73NG0JARSFe0A"
tab = "videos"
ctoken = channel_ctoken_v4(channel_id, sys.argv[1], 1, tab, 1)
ctoken = ctoken.replace('=', '%3D')
# Not sure what the purpose of the key is or whether it will change
# For now it seems to be constant for the API endpoint, not dependent
# on the browsing session or channel
key = 'AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8'
url = 'https://www.youtube.com/youtubei/v1/browse?key=' + key
data = {
'context': {
'client': {
'hl': 'en',
'gl': 'US',
'clientName': 'WEB',
'clientVersion': '2.20180830',
},
},
'continuation': ctoken,
}
content_type_header = (('Content-Type', 'application/json'),)
content = util.fetch_url(
url, headers_desktop + content_type_header,
data=json.dumps(data), debug_name='channel_tab')
info = json.loads(content)
print(json.dumps(info, indent=True))
page is used verbatim (as string as retrieved from sys.argv) as window start:
def channel_ctoken_v4(channel_id, page, sort, tab, view=1):
pointless_nest = proto.string(80226972,
proto.string(2, channel_id) +
proto.string(3, proto.percent_b64encode( proto.string(110, proto.string(3, proto.string(15,
proto.string(1, proto.string(1, proto.unpadded_b64encode( proto.string(1, proto.unpadded_b64encode(
proto.string(2, b"ST:" + proto.unpadded_b64encode( proto.string(2, page))))))) +
proto.string(2, "640adaee-0000-251d-8026-582429a8ab28") #targetId
) +
proto.uint(3, 1) # 1 - newest, 2 - popular))))))
return base64.urlsafe_b64encode(pointless_nest).decode('ascii')
Any idea what this checksum-like item 2 could be? It seems to be constant-width 64 bit. I'd say that's to short for a signature. It could be CRC-64. Does YT have a history of using that or any other 64bit-wide algorithm?
I've tried looking hard at the hex representation of that value of different continuation tokens but couldn't spot a pattern. It does not seem to be a simple binary encoding of parameters.
What's interesting is that two continuation tokens for the same page of videos of the same channel but with different timestamps in item 8 will have the same (!) checksum (or whatever it is) in item 2. So at least these two do not seem to be part of the calculation.
Any idea what this checksum-like item 2 could be? It seems to be constant-width 64 bit. I'd say that's to short for a signature. It could be CRC-64. Does YT have a history of using that or any other 64bit-wide algorithm?
I just remembered something, I believe it's actually a video id: https://github.com/iv-org/invidious/issues/1319#issuecomment-671732646
I have also discovered that the field_number=2 "junk" after the offset is actually a protobuf structure containing an integer which corresponds to a video id. Convert the integer to big endian bytes, encode those bytes as base64, drop the equals signs, and you get the video id which it aligns with. Now, the difficulty comes with sorting by oldest. The number is 17254859483345278706 this time. But simply leaving it at that and using the typical offsets refuses to work (I believe this big number might specify which protobuf schema to use). When sorting by oldest, instead of an offset for each multiple of 60, that same base64 encoded slot has a string with the video id to align with plus the unix timestamp for that video's upload date. It uses the same field_number=2 video_id alignment method for an offset of 30 from there. Unfortunately, it looks like there's no way to generate ctokens for arbitrary offsets when sorting by oldest, unless there's a hidden slot for an offset that I don't know about. So this has been a waste of time: changing to use the mobile api endpoint looks like the only solution... At least I've found a potential method to get the exact upload date for videos without an API key.
This was for the obsolete v2 ctoken format
So for your example, enc((13421533910578046448).to_bytes(8,byteorder='big')) -> 'ukLd-zjdJfA=' which corresponds to this Foo Fighters video: http://youtu.be/ukLd-zjdJfA
But given that your code still works without those parameters, I'll start working on using it in a fix
I don't know where that offset of 13 (assuming zero-based indexing, i.e. index 326 should be video 327) is coming from.
Oh and this is because the video count is retrieved from a special playlist youtube generates for channels that contains their uploads. Take a channel ID, UCi2KNss4Yx73NG0JARSFe0A, replace UC with UU, UUi2KNss4Yx73NG0JARSFe0A, and if you use that as a playlist ID, it gives a playlist of the channel uploads, which sometimes contains different videos from those displayed by default in the channel (I think due to removed videos showing up in the playlist but not sure)
Latest commit fully fixes arbitrary paging when sorting by newest. Unfortunately, metadata such as the channel name, channel description, and channel avatar is no longer returned by the continuation requests. Playlist "next page" button also works when sorting by newest.
However, sorting videos by popular still isn't working when I changed that sort key in the ctoken. @michaelweiser Have you figured out if your technique still works when sorting by popular?
So for your example, enc((13421533910578046448).to_bytes(8,byteorder='big')) -> 'ukLd-zjdJfA=' which corresponds to this Foo Fighters video: http://youtu.be/ukLd-zjdJfA
Yes! And this appears to be the last video of the previous page, i.e. the page this continuation token was sent out with. If we tried to use this, it would likely break arbitrary paging because we'd need to know the last video of the previous page. I'll have a tinker with that tonight to see what the constraints are.
Latest commit fully fixes arbitrary paging when sorting by newest. Unfortunately, metadata such as the channel name, channel description, and channel avatar is no longer returned by the continuation requests. Playlist "next page" button also works when sorting by newest.
Thanks for this!
However, sorting videos by popular still isn't working when I changed that sort key in the ctoken. @michaelweiser Have you figured out if your technique still works when sorting by popular?
Yes it does. I think it not working in ytl is due to sort
being a string:
diff --git a/youtube/channel.py b/youtube/channel.py
index 2a3420b..f09b222 100644
--- a/youtube/channel.py
+++ b/youtube/channel.py
@@ -33,7 +33,7 @@ generic_cookie = (('Cookie', 'VISITOR_INFO1_LIVE=ST1Ti53r4fU'),)
# https://github.com/user234683/youtube-local/issues/151
def channel_ctoken_v4(channel_id, page, sort, tab, view=1):
- new_sort = (2 if sort == 1 else 1)
+ new_sort = (2 if sort == "1" else 1)
offset = str(30*(int(page) - 1))
pointless_nest = proto.string(80226972,
proto.string(2, channel_id)
BTW: By accident I found out that you can also specify 3 for sorting in the continuation token. It changes sorting but I have yet to spot what rule it follows. It's not "sort by oldest". :( Values 4 and higher are rejected with 400 Bad Request.
Also, the page links send me to actual YT currently:
@@ -335,7 +335,6 @@ def get_channel_page_general_url(base_url, tab, request, channel_id=None):
if info['error'] is not None:
return flask.render_template('error.html', error_message = info['error'])
- post_process_channel_info(info)
if tab == 'videos':
info['number_of_videos'] = number_of_videos
info['number_of_pages'] = math.ceil(number_of_videos/30)
@@ -347,6 +346,7 @@ def get_channel_page_general_url(base_url, tab, request, channel_id=None):
elif tab == 'search':
info['search_box_value'] = query
info['header_playlist_names'] = local_playlist.get_playlist_names()
+ post_process_channel_info(info)
if tab in ('search', 'playlists'):
info['page_number'] = page_number
info['subscribed'] = subscriptions.is_subscribed(info['channel_id'])
Yes! And this appears to be the last video of the previous page, i.e. the page this continuation token was sent out with. If we tried to use this, it would likely break arbitrary paging because we'd need to know the last video of the previous page. I'll have a tinker with that tonight to see what the constraints are.
So playing with this it appears that as soon as this video ID is present all other values are ignored and you just get the next 30 videos after this one.
So I think the approach you took is the best we're going to get out of whatever all this continuation stuff is supposed to be.
By accident I found out that you can also specify 3 for sorting in the continuation token. It changes sorting but I have yet to spot what rule it follows.
I still can't figure out, what this is supposed to be. It returns 100 videos instead of 30 and a different set at each reload. Can't tell if the offset changes anything. Random playback, maybe?
Great catches, just pushed out a fix for those.
I still can't figure out, what this is supposed to be. It returns 100 videos instead of 30 and a different set at each reload. Can't tell if the offset changes anything. Random playback, maybe?
My only guess is it's used internally for some feature like music channel shuffling. Maybe there's a YouTube Music app and it's a shuffle button on the channels.
Also I just discovered that the channel playlist page and searches are still using the v3 ctoken format, so I added a conditional for that (this means the paging still works there as well)
Only thing that would be nice is if we could use query parameters instead of continuations to sort the video by popular for the first page, that way, the first page when sorting by popular won't break when YouTube changes their ctokens again, but I was unable to get any query parameters to work unfortunately.
Putting this here to remind myself. Last thing that needs to be done before closing this is caching the channel name so that adding videos to playlists from pages > 1 will preserve the channel name and url
As I understand it, sort by oldest is still broken?
As I understand it, sort by oldest is still broken?
Yes, because the feature was removed by YouTube. There's a button to skip to the last page of the channel to work around it, then you can just go to previous pages to explore the oldest videos. Another way to work around it would be to use the channel videos playlist and reverse the order, but that would require a lot of extra work to implement
When viewing the videos tab of a channel, clicking on any other page at the bottom shows the contents of the first page.
Also sorting by oldest/views is not working.
If sorting by oldest is selected, clicking "next page" shows:
"Failure getting metadata"
When viewing the playlist tab of a channel sorting by oldest doesn't seem to work.
When viewing the playlist tab of a channel and sorting by newest/oldest/last video added is selected, clicking "next page" shows:
500 Uncaught exception: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python3.9/site-packages/flask/_compat.py", line 39, in reraise raise value File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functionsrule.endpoint File "/Applications/youtube-local-2.7.2/youtube/channel.py", line 315, in get_channel_page return get_channel_page_general_url('https://www.youtube.com/channel/' + channel_id, tab, request, channel_id) File "/Applications/youtube-local-2.7.2/youtube/channel.py", line 278, in get_channel_page_general_url polymer_json = get_channel_tab(channel_id, page_number, sort, File "/Applications/youtube-local-2.7.2/youtube/channel.py", line 139, in get_channel_tab content = util.fetch_url( File "/Applications/youtube-local-2.7.2/youtube/util.py", line 359, in fetch_url raise FetchError(str(response.status), reason=response.reason, youtube.util.FetchError: HTTP error during request: 400 Bad Request