tombulled / innertube

Python Client for Google's Private InnerTube API. Works with YouTube, YouTube Music and more!
https://pypi.org/project/innertube/
MIT License
293 stars 20 forks source link

YouTube Comments #60

Closed EtorixDev closed 2 months ago

EtorixDev commented 12 months ago

Hello, I notice in #17 it's stated that getting comments is not part of the InnerTube API. I'm not sure if things have changed or if I am misunderstanding what constitutes as part of the InnerTube API, but by doing the following I have managed to get the comments:

  1. Send a next request to https://www.youtube.com/youtubei/v1/next?key={key} with the specified video ID in the data.
  2. Extract the continuation token. There's a default, a "Top" sort, and a "New" sort. I've only tried the default.
  3. Sending a second next request without specifying the video ID, but instead specifying the continuation in the data block.
  4. This should return the first 20 or so comments in a very ugly nested way.

Something I've yet to figure out is how to get a highlighted comment to appear at the top of the json list. If you click on a YouTube comment's date, it will open a link with a "&lc=" param that has the comment's ID. And in the comments it will appear at the top as "Highlighted".

If I use the continuation token for the second request from the dev tools inspector when loading the highlighted comment link in the browser then the second next request properly returns the highlighted comment at the top of the json list.

However, if I try using the continuation retrieved from the first next request programmatically then it always returns the comments without the highlighted comment at the top, so it can be assumed the highlighted comment is tied to the continuation token which seems to be generated outside of the scope of the next endpoint, unless I've simply not found the correct way yet.

tombulled commented 10 months ago

Hi, apologies for the late reply, I'll take a look into this now

tombulled commented 10 months ago

I've been able to reproduce the ability to list the first n comments (either "top" or "newest").

Here's the (admittedly lashed together) script I used:

from innertube import InnerTube

ENGAGEMENT_SECTION_COMMENTS = "engagement-panel-comments-section"
C0MMENTS_TOP = "Top comments"
COMMENTS_NEWEST = "Newest first"

def parse_text(text):
    return "".join(run["text"] for run in text["runs"])

def extract_engagement_panels(next_data):
    engagement_panels = {}
    raw_engagement_panels = next_data.get("engagementPanels", [])

    for raw_engagement_panel in raw_engagement_panels:
        engagement_panel = raw_engagement_panel.get(
            "engagementPanelSectionListRenderer", {}
        )
        target_id = engagement_panel.get("targetId")

        engagement_panels[target_id] = engagement_panel

    return engagement_panels

def parse_sort_filter_sub_menu(menu):
    menu_items = menu["sortFilterSubMenuRenderer"]["subMenuItems"]

    return {menu_item["title"]: menu_item for menu_item in menu_items}

def extract_comments(next_continuation_data):
    return [
        continuation_item["commentThreadRenderer"]
        for continuation_item in next_continuation_data["onResponseReceivedEndpoints"][
            1
        ]["reloadContinuationItemsCommand"]["continuationItems"][:-1]
    ]

# YouTube Web CLient
client = InnerTube("WEB", "2.20240105.01.00")

# ShortCircuit - Dell just DESTROYED the Surface Pro! - Dell XPS 13 2-in-1
video = client.next("BV1O7RR-VoA")

engagement_panels = extract_engagement_panels(video)
comments = engagement_panels[ENGAGEMENT_SECTION_COMMENTS]
comments_header = comments["header"]["engagementPanelTitleHeaderRenderer"]
comments_title = parse_text(comments_header["title"])
comments_context = parse_text(comments_header["contextualInfo"])
comments_menu_items = parse_sort_filter_sub_menu(comments_header["menu"])
comments_top = comments_menu_items[C0MMENTS_TOP]
comments_top_continuation = comments_top["serviceEndpoint"]["continuationCommand"][
    "token"
]

print(f"{comments_title} ({comments_context})...")
print()

comments_continuation = client.next(continuation=comments_top_continuation)

comments = extract_comments(comments_continuation)

for comment in comments:
    comment_renderer = comment["comment"]["commentRenderer"]

    comment_author = comment_renderer["authorText"]["simpleText"]
    comment_content = parse_text(comment_renderer["contentText"])

    print(f"[{comment_author}]")
    print(comment_content)
    print()
$ python app.py
Comments (1.7K)...

[@ViXoZuDo]
I would 100% prefer the headphone jack over that camera...

[@ouilsen2]
As a Surface Pro user I have one observation...

...

(I'll add this to the examples/ directory in case it helps anyone else)

I'll have a fiddle with highlighting a comment now in case I can figure out what's going on there

tombulled commented 10 months ago

It looks like highlighting a comment sends off a request to the /next endpoint with some params and the videoId. I'll see if I can whip up a quick PoC for this now

tombulled commented 10 months ago

I think I've figured out what was happening with highlighting a comment not working. The continuation tokens for "top" and "newest" you can extract from engagementPanels aren't influenced by the params passed to the /next endpoint, however the continuation token for the comment-item-section does change.

The below example ignores the engagementPanels entirely and instead uses the continuation token for the comments item section:

from innertube import InnerTube

# YouTube Web CLient
CLIENT = InnerTube("WEB", "2.20240105.01.00")

def parse_text(text):
    return "".join(run["text"] for run in text["runs"])

def flatten(items):
    flat_items = {}

    for item in items:
        key = next(iter(item))
        val = item[key]

        flat_items.setdefault(key, []).append(val)

    return flat_items

def flatten_item_sections(item_sections):
    return {
        item_section["sectionIdentifier"]: item_section
        for item_section in item_sections
    }

def extract_comments(next_continuation_data):
    return [
        continuation_item["commentThreadRenderer"]
        for continuation_item in next_continuation_data["onResponseReceivedEndpoints"][
            1
        ]["reloadContinuationItemsCommand"]["continuationItems"][:-1]
    ]

def extract_comments_continuation_token(next_data):
    contents = flatten(
        next_data["contents"]["twoColumnWatchNextResults"]["results"]["results"][
            "contents"
        ]
    )
    item_sections = flatten_item_sections(contents["itemSectionRenderer"])
    comment_item_section_content = item_sections["comment-item-section"]["contents"][0]
    comments_continuation_token = comment_item_section_content[
        "continuationItemRenderer"
    ]["continuationEndpoint"]["continuationCommand"]["token"]

    return comments_continuation_token

def get_comments(video_id, params=None):
    video = CLIENT.next(video_id, params=params)

    continuation_token = extract_comments_continuation_token(video)

    comments_continuation = CLIENT.next(continuation=continuation_token)

    return extract_comments(comments_continuation)

def print_comment(comment):
    comment_renderer = comment["comment"]["commentRenderer"]

    comment_author = comment_renderer["authorText"]["simpleText"]
    comment_content = parse_text(comment_renderer["contentText"])

    print(f"[{comment_author}]")
    print(comment_content)
    print()

video_id = "BV1O7RR-VoA"

# Get comments for a given video
comments = get_comments(video_id)

# Select a comment to highlight (in this case the 3rd one)
comment = comments[2]

# Print the comment we're going to highlight
print("### Highlighting Comment: ###")
print()
print_comment(comment)
print("---------------------")
print()

# Extract the 'params' to highlight this comment
params = comment["comment"]["commentRenderer"]["publishedTimeText"]["runs"][0][
    "navigationEndpoint"
]["watchEndpoint"]["params"]

# Get comments, but highlighting the selected comment
highlighted_comments = get_comments(video_id, params=params)

print("### Comments: ###")
print()

for comment in highlighted_comments:
    print_comment(comment)
$ python app.py
### Highlighting Comment: ###

[@alphacompton]
The built in mic on the 2-1 is exceptional and the camera is excellent from your video sample. Look like a better buy especially if it's cheaper than the Surface pro.

---------------------

### Comments: ###

[@alphacompton]
The built in mic on the 2-1 is exceptional and the camera is excellent from your video sample. Look like a better buy especially if it's cheaper than the Surface pro.

[@ouilsen2]
As a Surface Pro user I have one observation....

...

Hope that helps!

Please let me know if you have any further questions, or if this answers your query

Best, Tom

EtorixDev commented 10 months ago

Hi, thanks for the detailed reply.

The idea behind the highlighting was to store a reference (such as the comment ID) to it in a database and come back to it later. One such use case would be a system that checks for the existence of a membership badge on a user's message monthly. That's why it would have been ideal to have a way to programmatically jump straight to the comment in 1 request like in the browser (on the initial lookup, not just subsequent ones).

Unfortunately from your response it seems "highlighting" a comment internally is done with the comment's watchEndpoint params, so the initial request for the comment will require scraping them all until the target comment is found by checking for the comment ID, and then storing the params instead of the comment ID for future immediate lookup.

Would this work, or do you suspect the params of comments change often?

Thanks again.

tombulled commented 6 months ago

Hi @EtorixDev, apologies for the late turn around on a reply to your last comment. I believe the params field contains base-64 encoded protobuf data (potentially also url-encoded). You should be able to decode the contents of the param using a tool such as https://protobuf-decoder.netlify.app/. It is possible that the protobuf structure contains the comment ID, and that all other fields are static. If this is the case, you should be able to generate the correct params value knowing only the comment ID.

Unfortunately I went to test this using the examples/list-video-comments-highlighted.py example script I wrote a while back and it seems YouTube has changed their comments API around again. If I get some spare time I'll give the API another poke, however I hope this comment has at least given you a bit of a steer :slightly_smiling_face: