simonw / action-transcription-prototype

Open an issue with a YouTube URL, get back a transcript of that video
2 stars 1 forks source link

Action that triggers on issue creation #2

Closed simonw closed 2 years ago

simonw commented 2 years ago

Refs:

Started prototyping that here: https://github.com/simonw/try-out-issue-template-forms/issues/1 - with this issue template (only allowed in public repos at the moment): https://github.com/simonw/try-out-issue-template-forms/blob/afe6c8e3ed86ba03fb7318ce8131613ff4ecf16c/.github/ISSUE_TEMPLATE/url.yml

name: URL workflow
description: Process a URL
title: "[URL]: "
labels: ["url"]
body:
  - type: input
    id: url
    attributes:
      label: URL
      description: URL to an article or video
    validations:
      required: true
  - type: dropdown
    id: action
    attributes:
      label: What action would you like to take?
      options:
        - Extract transcript from video
        - Extract transcript from video and translate to English
    validations:
      required: true

The issue that this creates looks like this:

### URL

https://www.youtube.com/watch?v=OJIzTVyxIAw

### What action would you like to take?

Extract transcript from video and translate to English
simonw commented 2 years ago

Since you can have multiple issue template forms per repo I don't think that select dropdown is even needed. For this first implementation all I need as input is the URL. The title can even by hard-coded to "Please automatically transcribe and translate this URL".

simonw commented 2 years ago

A neat detail might be if the script updated the issue title to the title of the video as part of running - if that title was set to the default.

simonw commented 2 years ago

Here's what I need from https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#issues

on:
  issues:
    types:
    - opened

Then something like:

    steps:
      - run: |
          echo Issue $NUMBER
        env:
          NUMBER: ${{ github.event.issue.number }}
simonw commented 2 years ago

I'm going to use github-script here to retrieve the details of the issue. Then I'll run a Python script that extracts any URLs, checks if the user is allowed to execute the action (refs #5) and kicks off youtube-dl (#3).

I need a requirements.txt file to install and cache the dependencies - primarily youtube-dl. https://pypi.org/project/youtube_dl/

simonw commented 2 years ago

Relevant example from github-script README:

on:
  issues:
    types: [opened]

jobs:
  comment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/github-script@v6
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '👋 Thanks for reporting!'
            })
simonw commented 2 years ago

Maybe it posts a comment saying "Working on that" and then does the work.

simonw commented 2 years ago

To get the issue details: https://octokit.github.io/rest.js/v19#issues-get

octokit.rest.issues.get({
  owner,
  repo,
  issue_number,
});

I'm going to write them to a file that Python can read in the next step.

Or... I could even do the entire implementation in JavaScript. Might keep things a little bit simpler.

simonw commented 2 years ago

Code search: child_process path:.github/workflows/*.yml https://cs.github.com/?scopeName=All+repos&scope=&q=child_process+path%3A.github%2Fworkflows%2F*.yml

Gave me this example: https://github.com/jupyterlab/jupyterlab/blob/74405d37ad156d8bdc0e7c36f199abc1eb642c1f/.github/workflows/benchmark.yml#L31-L54

      - name: Get hashes for PR review event
        if: ${{ github.event_name == 'pull_request_review' }}
        uses: actions/github-script@v6
        with:
          script: |
            const child_process = require("child_process");
            const pull_request = context.payload.pull_request;
            child_process.exec(`git merge-base ${pull_request.head.sha} ${pull_request.base.sha}`, (error, stdout, stderr) => {
              if (error) {
                console.log(error);
                process.exit(1);
                return;
              }
              if (stderr) {
                console.log(stderr);
                process.exit(1);
                return;
              }
              core.exportVariable('OLD_REF_SHA', stdout.trim());
              core.exportVariable('NEW_REF_SHA', pull_request.head.sha);
              core.exportVariable('PULL_REQUEST_ID', pull_request.number);
            });
simonw commented 2 years ago

I could dump the log output from youtube-dl in a details/summary element in the issue comment too.

simonw commented 2 years ago

I opened issue #6 and it triggered this run: https://github.com/simonw/transcribe-videos/actions/runs/3119145264/jobs/5058919300

https://github.com/simonw/transcribe-videos/blob/2d7ff88f5db689416be8b4585d3104b0494149f3/.github/workflows/issue_created.yml#L18-L37

Output of that script:

{}

2021.12.17

So the youtube-dl --version bit worked but the issue fetching did not.

simonw commented 2 years ago

I'll try logging the full context.

simonw commented 2 years ago

OK, the context already has all of the information I need - no need to try and fetch more:

{
  "payload": {
    "action": "opened",
    "issue": {
      "active_lock_reason": null,
      "assignee": null,
      "assignees": [],
      "author_association": "OWNER",
      "body": "Testing for #2",
      "closed_at": null,
      "comments": 0,
      "comments_url": "https://api.github.com/repos/simonw/transcribe-videos/issues/7/comments",
      "created_at": "2022-09-24T17:26:06Z",
      "events_url": "https://api.github.com/repos/simonw/transcribe-videos/issues/7/events",
      "name": "transcribe-videos",
      "node_id": "R_kgDOIDpoZw",
      "notifications_url": "[https://api.github.com/repos/simonw/transcribe-videos/notifications{?since](https://api.github.com/repos/simonw/transcribe-videos/notifications%7B?since),all,participating}",
      "open_issues": 6,
      "open_issues_count": 6,
      "owner": {
        "avatar_url": "https://avatars.githubusercontent.com/u/9599?v=4",
        "events_url": "[https://api.github.com/users/simonw/events{/privacy}](https://api.github.com/users/simonw/events%7B/privacy%7D)",
        "followers_url": "https://api.github.com/users/simonw/followers",
        "following_url": "[https://api.github.com/users/simonw/following{/other_user}](https://api.github.com/users/simonw/following%7B/other_user%7D)",
        "gists_url": "[https://api.github.com/users/simonw/gists{/gist_id}](https://api.github.com/users/simonw/gists%7B/gist_id%7D)",
        "gravatar_id": "",
        "html_url": "https://github.com/simonw",
        "id": 9599,
        "login": "simonw",
        "node_id": "MDQ6VXNlcjk1OTk=",
        "organizations_url": "https://api.github.com/users/simonw/orgs",
        "received_events_url": "https://api.github.com/users/simonw/received_events",
        "repos_url": "https://api.github.com/users/simonw/repos",
        "site_admin": false,
        "starred_url": "[https://api.github.com/users/simonw/starred{/owner}{/repo}](https://api.github.com/users/simonw/starred%7B/owner%7D%7B/repo%7D)",
        "subscriptions_url": "https://api.github.com/users/simonw/subscriptions",
        "type": "User",
        "url": "https://api.github.com/users/simonw"
      },
      "private": true,
      "pulls_url": "[https://api.github.com/repos/simonw/transcribe-videos/pulls{/number}](https://api.github.com/repos/simonw/transcribe-videos/pulls%7B/number%7D)",
      "pushed_at": "2022-09-24T17:25:48Z",
      "releases_url": "[https://api.github.com/repos/simonw/transcribe-videos/releases{/id}](https://api.github.com/repos/simonw/transcribe-videos/releases%7B/id%7D)",
      "size": 2,
      "ssh_url": "git@github.com:simonw/transcribe-videos.git",
      "stargazers_count": 0,
      "stargazers_url": "https://api.github.com/repos/simonw/transcribe-videos/stargazers",
      "statuses_url": "[https://api.github.com/repos/simonw/transcribe-videos/statuses/{sha}](https://api.github.com/repos/simonw/transcribe-videos/statuses/%7Bsha%7D)",
      "subscribers_url": "https://api.github.com/repos/simonw/transcribe-videos/subscribers",
      "subscription_url": "https://api.github.com/repos/simonw/transcribe-videos/subscription",
      "svn_url": "https://github.com/simonw/transcribe-videos",
      "tags_url": "https://api.github.com/repos/simonw/transcribe-videos/tags",
      "teams_url": "https://api.github.com/repos/simonw/transcribe-videos/teams",
      "topics": [],
      "trees_url": "[https://api.github.com/repos/simonw/transcribe-videos/git/trees{/sha}](https://api.github.com/repos/simonw/transcribe-videos/git/trees%7B/sha%7D)",
      "updated_at": "2022-09-24T17:09:26Z",
      "url": "https://api.github.com/repos/simonw/transcribe-videos",
      "visibility": "private",
      "watchers": 0,
      "watchers_count": 0,
      "web_commit_signoff_required": false
    },
    "sender": {
      "avatar_url": "https://avatars.githubusercontent.com/u/9599?v=4",
      "events_url": "[https://api.github.com/users/simonw/events{/privacy}](https://api.github.com/users/simonw/events%7B/privacy%7D)",
      "followers_url": "https://api.github.com/users/simonw/followers",
      "following_url": "[https://api.github.com/users/simonw/following{/other_user}](https://api.github.com/users/simonw/following%7B/other_user%7D)",
      "gists_url": "[https://api.github.com/users/simonw/gists{/gist_id}](https://api.github.com/users/simonw/gists%7B/gist_id%7D)",
      "gravatar_id": "",
      "html_url": "https://github.com/simonw",
      "id": 9599,
      "login": "simonw",
      "node_id": "MDQ6VXNlcjk1OTk=",
      "organizations_url": "https://api.github.com/users/simonw/orgs",
      "received_events_url": "https://api.github.com/users/simonw/received_events",
      "repos_url": "https://api.github.com/users/simonw/repos",
      "site_admin": false,
      "starred_url": "[https://api.github.com/users/simonw/starred{/owner}{/repo}](https://api.github.com/users/simonw/starred%7B/owner%7D%7B/repo%7D)",
      "subscriptions_url": "https://api.github.com/users/simonw/subscriptions",
      "type": "User",
      "url": "https://api.github.com/users/simonw"
    }
  },
  "eventName": "issues",
  "sha": "baccabae6edae65ae5b6279120a21e92cc648e1d",
  "ref": "refs/heads/main",
  "workflow": ".github/workflows/issue_created.yml",
  "action": "__actions_github-script",
  "actor": "simonw",
  "job": "comment",
  "runNumber": 2,
  "runId": 3119153836,
  "apiUrl": "https://api.github.com/",
  "serverUrl": "https://github.com/",
  "graphqlUrl": "https://api.github.com/graphql"
}
simonw commented 2 years ago

I can look at context.payload.sender.login to see if they are allowed to do this.

I can also check and see if context.payload.issue.author_association is "OWNER" so owners of the repo can always trigger actions.

simonw commented 2 years ago

I'm going to parse the issue body the easiest way: split into lines and treat the first line that starts with http:// or https:// as being the URL to process.

simonw commented 2 years ago

Fun aside: need to try to avoid command injection attacks here, since I'm passing user input to youtube-dl.

I should use child_process.spawn() instead of .exec() since that takes a list of arguments.

https://nodejs.org/api/child_process.html#child_processspawnsynccommand-args-options - I can use child_process.spawnSync(command[, args].

simonw commented 2 years ago

Useful way to test that code locally:

node -e "
const child_process = require('child_process');
console.log(child_process.spawnSync('youtube-dl', ['--version'], {
    encoding: 'utf8'
}));
"
{
  status: 0,
  signal: null,
  output: [ null, '2021.12.17\n', '' ],
  pid: 15152,
  stdout: '2021.12.17\n',
  stderr: ''
}
simonw commented 2 years ago

Tried this:

youtube-dl --all-subs --skip-download 'https://www.youtube.com/watch?v=m0mwlSZ0bQQ'

Got a single file:

It's a pile of mining waste. Want to go skiing on it-m0mwlSZ0bQQ.en.vtt

Then I tried getting the auto-generated subs too:

youtube-dl --write-auto-sub --all-subs --skip-download 'https://www.youtube.com/watch?v=m0mwlSZ0bQQ'

This got me a LOT of files. Truncated:

[youtube] m0mwlSZ0bQQ: Downloading webpage
[info] Writing video subtitles to: It's a pile of mining waste. Want to go skiing on it-m0mwlSZ0bQQ.af.vtt
[info] Writing video subtitles to: It's a pile of mining waste. Want to go skiing on it-m0mwlSZ0bQQ.ak.vtt
[info] Writing video subtitles to: It's a pile of mining waste. Want to go skiing on it-m0mwlSZ0bQQ.sq.vtt
...

It was 126 total!

The .es.* one ends like this:

00:03:20.879 --> 00:03:23.330 align:start position:0%
Voy a bajar el pequeño
teleférico <00:03:21.404><c>¿Cómo </c><00:03:21.929><c>puedo? </c><00:03:22.454><c>Estoy </c><00:03:22.979><c>realmente</c>

00:03:23.330 --> 00:03:23.340 align:start position:0%
teleférico ¿Cómo puedo? Estoy realmente

00:03:23.340 --> 00:03:24.830 align:start position:0%
teleférico ¿Cómo puedo? Estoy realmente
aterrorizado. <00:03:23.603><c>Me </c><00:03:23.866><c>voy </c><00:03:24.129><c>a </c><00:03:24.392><c>resbalar. </c><00:03:24.655><c>¿</c>

00:03:24.830 --> 00:03:24.840 align:start position:0%
aterrorizado. Me voy a resbalar. ¿

00:03:24.840 --> 00:03:27.710 align:start position:0%
aterrorizado. Me voy a resbalar. ¿
Cómo <00:03:25.740><c>salgo </c><00:03:26.640><c>de </c><00:03:27.540><c>esto?</c>

00:03:27.710 --> 00:03:30.319 align:start position:0%
Cómo salgo de esto?
simonw commented 2 years ago

Trying with https://www.youtube.com/watch?v=OJIzTVyxIAw - the Russian one I tried in #1. Without --write-auto-sub I get nothing, because that video does not have captions.

With --write-auto-sub --sub-lang en,ru I get ALL of those files - it looks like the --write-auto-sub option always gets everything, ignoring the --sub-lang option entirely.

More on that here: https://askubuntu.com/questions/1023339/youtube-dl-keep-both-auto-generated-subtitles-and-prewritten-ones

It sounds like if there are manual subtitles AND auto subtitles the manual ones over-write the auto ones, unless you run the command twice and save the files with different names that way.

simonw commented 2 years ago

These subtitles are a bit messay:

d % cat *.en.*
WEBVTT
Kind: captions
Language: en

00:00:00.350 --> 00:00:06.079 align:start position:0%

March <00:00:00.879><c>18, </c><00:00:01.408><c>2018 </c><00:00:01.937><c>the </c><00:00:02.466><c>shadow </c><00:00:02.995><c>of </c><00:00:03.524><c>the </c><00:00:04.053><c>main </c><00:00:04.582><c>choice </c><00:00:05.111><c>of </c><00:00:05.640><c>the</c>

00:00:06.079 --> 00:00:06.089 align:start position:0%
March 18, 2018 the shadow of the main choice of the

00:00:06.089 --> 00:00:10.700 align:start position:0%
March 18, 2018 the shadow of the main choice of the
country <00:00:06.672><c>March </c><00:00:07.255><c>18 </c><00:00:07.838><c>the </c><00:00:08.421><c>day </c><00:00:09.004><c>that </c><00:00:09.587><c>decides </c><00:00:10.170><c>the</c>

00:00:10.700 --> 00:00:10.710 align:start position:0%
country March 18 the day that decides the

00:00:10.710 --> 00:00:14.539 align:start position:0%
country March 18 the day that decides the
fate <00:00:11.235><c>of </c><00:00:11.760><c>Russia </c><00:00:12.285><c>the </c><00:00:12.810><c>day </c><00:00:13.335><c>that </c><00:00:13.860><c>determines</c>

00:00:14.539 --> 00:00:14.549 align:start position:0%
fate of Russia the day that determines

00:00:14.549 --> 00:00:16.599 align:start position:0%
fate of Russia the day that determines
our <00:00:15.059><c>future</c>

00:00:16.599 --> 00:00:16.609 align:start position:0%
our future

It looks like those are encoding actual animation, where the subtitles scroll to match the text as it is spoken.

I don't want that though!

simonw commented 2 years ago
d % youtube-dl --list-subs 'https://www.youtube.com/watch?v=OJIzTVyxIAw' | grep en
en       vtt, ttml, srv3, srv2, srv1

I'm getting vtt by default - maybe one of the other formats would avoid the animation issue and just give me the plain text?

simonw commented 2 years ago

Tried two other formats like this:

mkdir ttml
cd ttml
youtube-dl --sub-format ttml  --all-subs --skip-download --write-auto-sub 'https://www.youtube.com/watch?v=OJIzTVyxIAw'
cd ..
mkdir srv3
cd srv3
youtube-dl --sub-format srv3  --all-subs --skip-download --write-auto-sub 'https://www.youtube.com/watch?v=OJIzTVyxIAw'
simonw commented 2 years ago

Here's ttml:

d % cat ttml/*.en.*
<?xml version="1.0" encoding="utf-8" ?>
<tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" ttp:profile="http://www.w3.org/TR/profile/sdp-us" >
<head>
<styling>
<style xml:id="s1" tts:textAlign="center" tts:extent="90% 90%" tts:origin="5% 5%" tts:displayAlign="after"/>
<style xml:id="s2" tts:fontSize=".72c" tts:backgroundColor="black" tts:color="white"/>
</styling>
<layout>
<region xml:id="r1" style="s1"/>
</layout>
</head>
<body region="r1">
<div>
<p begin="00:00:00.350" end="00:00:10.710" style="s2">March 18, 2018 the shadow of the main choice of the</p>
<p begin="00:00:06.089" end="00:00:14.549" style="s2">country March 18 the day that decides the</p>
<p begin="00:00:10.710" end="00:00:16.609" style="s2">fate of Russia the day that determines</p>
<p begin="00:00:14.549" end="00:00:21.060" style="s2">our future</p>
<p begin="00:00:16.609" end="00:00:22.220" style="s2">March 18 presidential elections in the Russian</p>
<p begin="00:00:21.060" end="00:00:25.140" style="s2">Federation</p>
<p begin="00:00:22.220" end="00:00:28.230" style="s2">every citizen of the country who has reached the age of</p>
<p begin="00:00:25.140" end="00:00:31.579" style="s2">18 has the right to</p>
<p begin="00:00:28.230" end="00:00:37.840" style="s2">vote in the presidential elections in Russia</p>
<p begin="00:00:31.579" end="00:00:40.999" style="s2">March 18 is the day when every vote matters</p>
<p begin="00:00:37.840" end="00:00:40.999" style="s2">[music  ]</p>
</div>
</body>
</tt>

And here's srv3 (truncated):

d % cat srv3/*.en.*
<?xml version="1.0" encoding="utf-8" ?><timedtext format="3">
<head>
<ws id="0"/>
<ws id="1" mh="2" ju="0" sd="3"/>
<wp id="0"/>
<wp id="1" ap="6" ah="20" av="100" rc="2" cc="40"/>
</head>
<body>
<w t="0" id="1" wp="1" ws="1"/>
<p t="350" d="10360" w="1"><s ac="252">March </s><s t="529" ac="252">18, </s><s t="1058" ac="252">2018 </s><s t="1587" ac="252">the </s><s t="2116" ac="252">shadow </s><s t="2645" ac="252">of </s><s t="3174" ac="252">the </s><s t="3703" ac="252">main </s><s t="4232" ac="252">choice </s><s t="4761" ac="252">of </s><s t="5290" ac="252">the</s></p>
<p t="6079" d="4631" w="1" a="1">
</p>
<p t="6089" d="8460" w="1"><s ac="238">country </s><s t="583" ac="227">March </s><s t="1166" ac="227">18 </s><s t="1749" ac="227">the </s><s t="2332" ac="227">day </s><s t="2915" ac="227">that </s><s t="3498" ac="227">decides </s><s t="4081" ac="227">the</s></p>

I think I like ttml the best - looks easy to parse too.

simonw commented 2 years ago

I'm going to store the full subtitle files in the repo. The issue comment reply will just contain the text, in an easy-to-paste format.

simonw commented 2 years ago

I'm going to finish this work in: