user234683 / youtube-local

browser-based client for watching Youtube anonymously and with greater page performance
GNU Affero General Public License v3.0
501 stars 62 forks source link

Switching between pages in channel view always displays content of first page #151

Closed metrast closed 1 year ago

metrast commented 1 year ago

"Failure getting metadata"

500 Uncaught exception: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python3.9/site-packages/flask/_compat.py", line 39, in reraise raise value File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functionsrule.endpoint File "/Applications/youtube-local-2.7.2/youtube/channel.py", line 315, in get_channel_page return get_channel_page_general_url('https://www.youtube.com/channel/' + channel_id, tab, request, channel_id) File "/Applications/youtube-local-2.7.2/youtube/channel.py", line 278, in get_channel_page_general_url polymer_json = get_channel_tab(channel_id, page_number, sort, File "/Applications/youtube-local-2.7.2/youtube/channel.py", line 139, in get_channel_tab content = util.fetch_url( File "/Applications/youtube-local-2.7.2/youtube/util.py", line 359, in fetch_url raise FetchError(str(response.status), reason=response.reason, youtube.util.FetchError: HTTP error during request: 400 Bad Request

user234683 commented 1 year ago

Possibly related to https://github.com/TeamNewPipe/NewPipe/issues/9223

user234683 commented 1 year ago

Relevant invidious issue as well as pull request. Problem is due to a new continuation token format as well as a new JSON response format. They claim arbitrary paging is not possible with the new ctoken format but I think it might be

michaelweiser commented 1 year ago

Not being able to access old channel videos is getting annyoing. So I played with the continuation tokens a bit and this is where I'm at:

def channel_ctoken_v4(channel_id, page, sort, tab, view=1):
    pointless_nest = proto.string(80226972,
        proto.string(2, channel_id) +
        proto.string(3,
            proto.percent_b64encode(
                proto.string(110,
                    proto.string(3,
                        proto.string(15,
                            proto.string(1,
                                proto.string(1,
                                    proto.unpadded_b64encode(
                                        proto.string(1,
                                            proto.unpadded_b64encode(
                                                proto.string(2,
                                                    b"ST:" +
                                                    proto.unpadded_b64encode(
                                                        proto.string(2, "150") # window start
                                                    )
                                                )
                                            )
                                        ) +
                                        proto.string(2,
                                            # some checksum which has to match
                                            # so that page offset is accepted?
                                            proto.uint(1, 13421533910578046448)) +
                                        proto.uint(5, 50) + # window size?
                                        proto.uint(6, 6) + # page of 30
                                        proto.uint(7, 180) + # offset in steps of 30 inside window
                                        proto.string(8,
                                            # seconds since epoch initial
                                            # request, static across successive requests
                                            proto.uint(1, 1676803892) +
                                            # unknown but static across successive tokens
                                            proto.uint(2, 197625613)) +
                                        proto.uint(9, 6) + # page of 30
                                        proto.uint(10, 150) # window start
                                    )
                                ) +
                                proto.string(2, "63faaff0-0000-23fe-80f0-582429d11c38") #targetId
                            ) +
                            proto.uint(3, 1)    # 1 - newest, 2 - popular
                        )
                    )
                )
            )
        )
    )

    return base64.urlsafe_b64encode(pointless_nest).decode('ascii')

It seems a sliding window was added on top of the page concept. Merits and behaviour unclear. The page offset has to lie inside the sliding window and the checksum in field 2 has to match for it to be accepted. targetId can be harvested from the initial request response. tab and view info seem gone.

Almost all of that is optional and I am able to arbitrarily retrieve "pages" of 30 videos by just moving the sliding window start offset in that funky ST: string like so:

    pointless_nest = proto.string(80226972,
        proto.string(2, channel_id) +
        proto.string(3,
            proto.percent_b64encode(
                proto.string(110,
                    proto.string(3,
                        proto.string(15,
                            proto.string(1,
                                proto.string(1,
                                    proto.unpadded_b64encode(
                                        proto.string(1,
                                            proto.unpadded_b64encode(
                                                proto.string(2,
                                                    b"ST:" +
                                                    proto.unpadded_b64encode(
                                                        # get 30 videos starting with number 4 (counting starts at 0)
                                                        proto.string(2, "3")
                                                    )
                                                )
                                            )
                                        )
                                    )
                                ) +
                                proto.string(2, "63faaff0-0000-23fe-80f0-582429d11c38") #targetId
                            ) +
                            proto.uint(3, 1)    # 1 - newest, 2 - popular
                        )
                    )
                )
            )
        )
    )

    return base64.urlsafe_b64encode(pointless_nest).decode('ascii')

Am I on the right track here?

We could go ahead and just work with the window start offset or we find out what field 2 actually is and calculate and provide it the expected way.

user234683 commented 1 year ago

@michaelweiser Thanks for this, this is very helpful. Are you able to skip pages (such as requesting the last page)? Or does it restrict you to requesting pages sequentially?

michaelweiser commented 1 year ago

It appears I can arbitrarily index into the video list without any previous knowledge or replayed request data just by synthesizing the continuation token locally (appart from the targetId). I've created a small test program which works like so (parameter is the window start):

$ ../bin/python3 t.py 2 | grep "^         \"videoId" | head -3
         "videoId": "_iZaJKOcwPo",
         "videoId": "IquHfs7H8Bk",
         "videoId": "QrVL3g4iS1c",
$ ../bin/python3 t.py 0 | grep "^         \"videoId" | head -3
         "videoId": "m3_aGlMk9a8",
         "videoId": "OBF17olqeOU",
         "videoId": "_iZaJKOcwPo",
$ ../bin/python3 t.py 1 | grep "^         \"videoId" | head -3
         "videoId": "OBF17olqeOU",
         "videoId": "_iZaJKOcwPo",
         "videoId": "IquHfs7H8Bk",
$ ../bin/python3 t.py 2 | grep "^         \"videoId" | head -3
         "videoId": "_iZaJKOcwPo",
         "videoId": "IquHfs7H8Bk",
         "videoId": "QrVL3g4iS1c",
$ ../bin/python3 t.py 3 | grep "^         \"videoId" | head -3
         "videoId": "IquHfs7H8Bk",
         "videoId": "QrVL3g4iS1c",
         "videoId": "tEBCNTqJSuc",
$ ../bin/python3 t.py 2 | grep "^         \"videoId" | head -3
         "videoId": "_iZaJKOcwPo",
         "videoId": "IquHfs7H8Bk",
         "videoId": "QrVL3g4iS1c",
$ ../bin/python3 t.py 1 | grep "^         \"videoId" | head -3
         "videoId": "OBF17olqeOU",
         "videoId": "_iZaJKOcwPo",
         "videoId": "IquHfs7H8Bk",

Knowing the number of videos in the channel, I can work my way backwards. In this case, youtube-local displays 327 as number and this is what I get:

# ../bin/python3 t.py 327 | grep "^         \"videoId" | head -3
# ../bin/python3 t.py 326 | grep "^         \"videoId" | head -3
# ../bin/python3 t.py 325 | grep "^         \"videoId" | head -3
# ../bin/python3 t.py 324 | grep "^         \"videoId" | head -3
# ../bin/python3 t.py 320 | grep "^         \"videoId" | head -3
# ../bin/python3 t.py 310 | grep "^         \"videoId" | head -3
         "videoId": "rhzmNRtIp8k",
         "videoId": "fTaOlBWcl48",
         "videoId": "4eNBM17tkjI",
# ../bin/python3 t.py 312 | grep "^         \"videoId" | head -3
         "videoId": "4eNBM17tkjI",
         "videoId": "eZIjxGY3Kok",
# ../bin/python3 t.py 313 | grep "^         \"videoId" | head -3
         "videoId": "eZIjxGY3Kok",

I don't know where that offset of 13 (assuming zero-based indexing, i.e. index 326 should be video 327) is coming from.

This is the program:

channel_id = "UCi2KNss4Yx73NG0JARSFe0A"

tab = "videos"
ctoken = channel_ctoken_v4(channel_id, sys.argv[1], 1, tab, 1)
ctoken = ctoken.replace('=', '%3D')

# Not sure what the purpose of the key is or whether it will change
# For now it seems to be constant for the API endpoint, not dependent
# on the browsing session or channel
key = 'AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8'
url = 'https://www.youtube.com/youtubei/v1/browse?key=' + key

data = {
    'context': {
        'client': {
            'hl': 'en',
            'gl': 'US',
            'clientName': 'WEB',
            'clientVersion': '2.20180830',
        },
    },
    'continuation': ctoken,
}

content_type_header = (('Content-Type', 'application/json'),)
content = util.fetch_url(
    url, headers_desktop + content_type_header,
    data=json.dumps(data), debug_name='channel_tab')

info = json.loads(content)
print(json.dumps(info, indent=True))

page is used verbatim (as string as retrieved from sys.argv) as window start:

def channel_ctoken_v4(channel_id, page, sort, tab, view=1):
    pointless_nest = proto.string(80226972,
        proto.string(2, channel_id) +
        proto.string(3, proto.percent_b64encode( proto.string(110, proto.string(3, proto.string(15,
            proto.string(1, proto.string(1, proto.unpadded_b64encode( proto.string(1, proto.unpadded_b64encode(
                    proto.string(2, b"ST:" + proto.unpadded_b64encode( proto.string(2, page))))))) +
                proto.string(2, "640adaee-0000-251d-8026-582429a8ab28") #targetId
            ) +
            proto.uint(3, 1)    # 1 - newest, 2 - popular))))))

    return base64.urlsafe_b64encode(pointless_nest).decode('ascii')

Any idea what this checksum-like item 2 could be? It seems to be constant-width 64 bit. I'd say that's to short for a signature. It could be CRC-64. Does YT have a history of using that or any other 64bit-wide algorithm?

I've tried looking hard at the hex representation of that value of different continuation tokens but couldn't spot a pattern. It does not seem to be a simple binary encoding of parameters.

What's interesting is that two continuation tokens for the same page of videos of the same channel but with different timestamps in item 8 will have the same (!) checksum (or whatever it is) in item 2. So at least these two do not seem to be part of the calculation.

user234683 commented 1 year ago

Any idea what this checksum-like item 2 could be? It seems to be constant-width 64 bit. I'd say that's to short for a signature. It could be CRC-64. Does YT have a history of using that or any other 64bit-wide algorithm?

I just remembered something, I believe it's actually a video id: https://github.com/iv-org/invidious/issues/1319#issuecomment-671732646

I have also discovered that the field_number=2 "junk" after the offset is actually a protobuf structure containing an integer which corresponds to a video id. Convert the integer to big endian bytes, encode those bytes as base64, drop the equals signs, and you get the video id which it aligns with. Now, the difficulty comes with sorting by oldest. The number is 17254859483345278706 this time. But simply leaving it at that and using the typical offsets refuses to work (I believe this big number might specify which protobuf schema to use). When sorting by oldest, instead of an offset for each multiple of 60, that same base64 encoded slot has a string with the video id to align with plus the unix timestamp for that video's upload date. It uses the same field_number=2 video_id alignment method for an offset of 30 from there. Unfortunately, it looks like there's no way to generate ctokens for arbitrary offsets when sorting by oldest, unless there's a hidden slot for an offset that I don't know about. So this has been a waste of time: changing to use the mobile api endpoint looks like the only solution... At least I've found a potential method to get the exact upload date for videos without an API key.

This was for the obsolete v2 ctoken format

So for your example, enc((13421533910578046448).to_bytes(8,byteorder='big')) -> 'ukLd-zjdJfA=' which corresponds to this Foo Fighters video: http://youtu.be/ukLd-zjdJfA

But given that your code still works without those parameters, I'll start working on using it in a fix

user234683 commented 1 year ago

I don't know where that offset of 13 (assuming zero-based indexing, i.e. index 326 should be video 327) is coming from.

Oh and this is because the video count is retrieved from a special playlist youtube generates for channels that contains their uploads. Take a channel ID, UCi2KNss4Yx73NG0JARSFe0A, replace UC with UU, UUi2KNss4Yx73NG0JARSFe0A, and if you use that as a playlist ID, it gives a playlist of the channel uploads, which sometimes contains different videos from those displayed by default in the channel (I think due to removed videos showing up in the playlist but not sure)

user234683 commented 1 year ago

Latest commit fully fixes arbitrary paging when sorting by newest. Unfortunately, metadata such as the channel name, channel description, and channel avatar is no longer returned by the continuation requests. Playlist "next page" button also works when sorting by newest.

However, sorting videos by popular still isn't working when I changed that sort key in the ctoken. @michaelweiser Have you figured out if your technique still works when sorting by popular?

michaelweiser commented 1 year ago

So for your example, enc((13421533910578046448).to_bytes(8,byteorder='big')) -> 'ukLd-zjdJfA=' which corresponds to this Foo Fighters video: http://youtu.be/ukLd-zjdJfA

Yes! And this appears to be the last video of the previous page, i.e. the page this continuation token was sent out with. If we tried to use this, it would likely break arbitrary paging because we'd need to know the last video of the previous page. I'll have a tinker with that tonight to see what the constraints are.

Latest commit fully fixes arbitrary paging when sorting by newest. Unfortunately, metadata such as the channel name, channel description, and channel avatar is no longer returned by the continuation requests. Playlist "next page" button also works when sorting by newest.

Thanks for this!

However, sorting videos by popular still isn't working when I changed that sort key in the ctoken. @michaelweiser Have you figured out if your technique still works when sorting by popular?

Yes it does. I think it not working in ytl is due to sort being a string:

diff --git a/youtube/channel.py b/youtube/channel.py
index 2a3420b..f09b222 100644
--- a/youtube/channel.py
+++ b/youtube/channel.py
@@ -33,7 +33,7 @@ generic_cookie = (('Cookie', 'VISITOR_INFO1_LIVE=ST1Ti53r4fU'),)

 # https://github.com/user234683/youtube-local/issues/151
 def channel_ctoken_v4(channel_id, page, sort, tab, view=1):
-    new_sort = (2 if sort == 1 else 1)
+    new_sort = (2 if sort == "1" else 1)
     offset = str(30*(int(page) - 1))
     pointless_nest = proto.string(80226972,
         proto.string(2, channel_id)

BTW: By accident I found out that you can also specify 3 for sorting in the continuation token. It changes sorting but I have yet to spot what rule it follows. It's not "sort by oldest". :( Values 4 and higher are rejected with 400 Bad Request.

Also, the page links send me to actual YT currently:

@@ -335,7 +335,6 @@ def get_channel_page_general_url(base_url, tab, request, channel_id=None):
     if info['error'] is not None:
         return flask.render_template('error.html', error_message = info['error'])

-    post_process_channel_info(info)
     if tab == 'videos':
         info['number_of_videos'] = number_of_videos
         info['number_of_pages'] = math.ceil(number_of_videos/30)
@@ -347,6 +346,7 @@ def get_channel_page_general_url(base_url, tab, request, channel_id=None):
     elif tab == 'search':
         info['search_box_value'] = query
         info['header_playlist_names'] = local_playlist.get_playlist_names()
+    post_process_channel_info(info)
     if tab in ('search', 'playlists'):
         info['page_number'] = page_number
     info['subscribed'] = subscriptions.is_subscribed(info['channel_id'])
michaelweiser commented 1 year ago

Yes! And this appears to be the last video of the previous page, i.e. the page this continuation token was sent out with. If we tried to use this, it would likely break arbitrary paging because we'd need to know the last video of the previous page. I'll have a tinker with that tonight to see what the constraints are.

So playing with this it appears that as soon as this video ID is present all other values are ignored and you just get the next 30 videos after this one.

So I think the approach you took is the best we're going to get out of whatever all this continuation stuff is supposed to be.

By accident I found out that you can also specify 3 for sorting in the continuation token. It changes sorting but I have yet to spot what rule it follows.

I still can't figure out, what this is supposed to be. It returns 100 videos instead of 30 and a different set at each reload. Can't tell if the offset changes anything. Random playback, maybe?

user234683 commented 1 year ago

Great catches, just pushed out a fix for those.

I still can't figure out, what this is supposed to be. It returns 100 videos instead of 30 and a different set at each reload. Can't tell if the offset changes anything. Random playback, maybe?

My only guess is it's used internally for some feature like music channel shuffling. Maybe there's a YouTube Music app and it's a shuffle button on the channels.

Also I just discovered that the channel playlist page and searches are still using the v3 ctoken format, so I added a conditional for that (this means the paging still works there as well)

Only thing that would be nice is if we could use query parameters instead of continuations to sort the video by popular for the first page, that way, the first page when sorting by popular won't break when YouTube changes their ctokens again, but I was unable to get any query parameters to work unfortunately.

user234683 commented 1 year ago

Putting this here to remind myself. Last thing that needs to be done before closing this is caching the channel name so that adding videos to playlists from pages > 1 will preserve the channel name and url

bitingsock commented 1 year ago

As I understand it, sort by oldest is still broken?

user234683 commented 1 year ago

As I understand it, sort by oldest is still broken?

Yes, because the feature was removed by YouTube. There's a button to skip to the last page of the channel to work around it, then you can just go to previous pages to explore the oldest videos. Another way to work around it would be to use the channel videos playlist and reverse the order, but that would require a lot of extra work to implement