mitodl / ocw-studio

Open Source Courseware authoring tool
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

No captions, no transcript on 3.091 fall 2018 #1518

Closed pdpinch closed 1 year ago

pdpinch commented 1 year ago

Steps to Reproduce

See, for example, https://ocw.mit.edu/courses/3-091-introduction-to-solid-state-chemistry-fall-2018/resources/lecture-1/

or https://ocw.mit.edu/courses/3-091-introduction-to-solid-state-chemistry-fall-2018/resources/carbon-dioxide-concentration-lec28/

Expected Behavior

Actual Behavior

Screenshot or Screencast

image
fakhar-ud-din commented 1 year ago

Hey @pdpinch

I am working on the fix, but seems like I do not have access to the content drive yet.

The captions are unavailable since we do not have a path set for them (I have attached a screenshot for this). Could you please update me with the status regarding the access, so I could proceed with the fix?

Screenshot 2023-02-14 at 1 48 20 PM

Thank You

pdpinch commented 1 year ago

I will give you access, but I don't think you're going to find the missing files on gdrive. This is a legacy course, and the files should have been imported directly to S3.

fakhar-ud-din commented 1 year ago

Hey @pdpinch

I have tested this locally by using random lecture's caption file, and setting the RESOURCE_BASE_URL to the s3 bucket link. This configuration is working fine locally. I think we do not have RESOURCE_BASE_URL on production, which causes the conditional rendering of the captions to fail.

I am still looking into this, and will keep you updated.

Thank you

fakhar-ud-din commented 1 year ago

I have successfully reproduced the issue locally, but I am not quite certain whether I am headed in the right direction.

Steps to reproduce locally

  1. Clone 3.091 fall 2018
  2. Comment out RESOURCE_BASE_URL in the env file
  3. Comment out RESOURCE_BASE_URL config in the env.ts file
  4. Start the course, yarn start course 3.091-fall-2018
  5. Navigate to http://localhost:3000/resources/lecture-1/
  6. The captions button and transcript box should be unavailable

NOTE:

  1. The video_captions_file in ocw-content-rc/3.091-fall-2018/content/resources/lecture-1.md should be null
  2. If either RESOURCE_BASE_URL and/ or video_captions_file is set to something, the captions and transcripts become available
ChristopherChudzicki commented 1 year ago

I want to mention that this is not limited to 3.091-fall-2018.

Here are a few more examples of videos that DO have high-quality transcripts on YouTube, but do NOT have associated transcripts on OCW.

@pt2302 @gumaerc and @abeglova have discussed these a little bit, but to my knowledge no concrete plan has been formed. (@pt2302 may corret me when he's back from vacation). Roughly, I believe we were thinking

I haven't used the 3play APIs myself, but that is my understanding of what @pt2302 @abeglova @gumaerc and I have discussed in the past.

Here is a list of OCW resources that do not have transcripts https://gist.github.com/ChristopherChudzicki/25a0310d2de4a568d6b19e8f009c86a8. Note:

fakhar-ud-din commented 1 year ago

Hey @ChristopherChudzicki

Thank you for the detailed response. Yes we actually have missing caption files for majority of the videos, and this issue is wide-spread to other courses as well.

My concern is related to the UI elements of captions and transcripts, since those should be rendering despite the availability of the captions and transcript files. Right?

The UI elements tends to disappear only when we are missing RESOURCE_BASE_URL or This is not set.

What are your thoughts on this?

ChristopherChudzicki commented 1 year ago

Yes we actually have missing caption files for majority of the videos,

I think it's more like ~15% of videos, but still more than just this course.

My concern is related to the UI elements of captions and transcripts, since those should be rendering despite the availability of the captions and transcript files. Right?

IMO, it's fine not to show the "CC" button if there are no captions. Similarly with the expandable transcript. If there's no transcript, no need to show the "Transcript" button. (If we do show it, and it's empty, then that's confusing. So if we do show it, we would want to show "No transcript available" or something. Might as well just not show it.)

This issue, as I understand it, is "Transcripts for 3.091 (and several other courses) used to exist in legacy ocw, but do not exist in current ocw. Let's figure out how to make those transcripts available in ocw-next."

pdpinch commented 1 year ago

Is it possible that this course (and the others) have a .srt resource but no .vtt resource?

fakhar-ud-din commented 1 year ago

IMO, it's fine not to show the "CC" button if there are no captions. Similarly with the expandable transcript. If there's no transcript, no need to show the "Transcript" button. (If we do show it, and it's empty, then that's confusing. So if we do show it, we would want to show "No transcript available" or something. Might as well just not show it.)

@ChristopherChudzicki It makes sense to not show those UI elements in case of unavailability of the transcripts and captions. Not sure, where to find the legacy ocw config and files?

Is it possible that this course (and the others) have a .srt resource but no .vtt resource?

@pdpinch not sure about the .srt but we are certainly missing .vtt for the all the resources having this issue

ChristopherChudzicki commented 1 year ago

@pdpinch @fakhar-ud-din Regarding SRT caption files for 3.091-2018, they do exist. For example

lecture 1
  video:
    https://github.mit.edu/mitocwcontent/3.091-fall-2018/blob/main/content/resources/lecture-1.md
  srt caption:
    https://github.mit.edu/mitocwcontent/3.091-fall-2018/blob/main/content/resources/ao41frjfgvq.md
  transcript:
    https://github.mit.edu/mitocwcontent/3.091-fall-2018/blob/main/content/resources/ao41frjfgvq-1.md

The video is explicitly associated with the transcript but not the SRT. However, the SRT filenames appear to be a lowercased version of the youtube ID, which does exist in the video file metadata.

I believe all videos in 3.091 have SRT captions (there are 92 srt files and 82 video files in the course, so, probably...).

This is not necessarily true for other courses with missing captions, so maybe we have multiple root causes for missing captions. E.g., 5-80-small-molecule-spectroscopy-and-dynamics-fall-2008 appears to have no caption files, srt or vtt.

@fakhar-ud-din I found the above information via SQL queries on RC (which was recently synced with prod). Querying models in Django shell would also work.

ChristopherChudzicki commented 1 year ago

As for what to do: I think "convert SRT -> VTT" would work fine for 3.091 2018. But again, there are more courses whose captions DO appear to exist on youtube, but do not have SRT files. A more general 3play approach (like sketched in https://github.com/mitodl/ocw-studio/issues/1518#issuecomment-1446392041) might be able to address the problem for more courses.

pt2302 commented 1 year ago

We can check whether a transcript with a given YouTube ID exists on 3Play, using this function: https://github.com/mitodl/ocw-studio/blob/210b230d65ecc1cee601054e64cb6adc1cf131e2/videos/threeplay_api.py#L118

If it exists, we can download the captions/transcript and associate them with a given video by something like this function: https://github.com/mitodl/ocw-studio/blob/210b230d65ecc1cee601054e64cb6adc1cf131e2/videos/threeplay_api.py#L153

Assuming most/all legacy captions are on 3Play, this seems like the most straightforward solution.

pdpinch commented 1 year ago

This is great folks. So the captions exist, they are probably in our 3play account and we have code for looking them up and fetching them.

How about we start with a management command that tries to fetch the .vtt caption file from 3play and add it the video as a new resource, using the functions Paul mentioned?

Feel free to suggest alternatives.

fakhar-ud-din commented 1 year ago

Thank you for the suggestions @pt2302 @ChristopherChudzicki I believe, I do not have the database dump locally, so I could not confirm using the django shell.

I will be moving forward with what @pt2302 suggested and will work on a management command which could fetch and sync available captions and transcripts.

fakhar-ud-din commented 1 year ago

I have been investigating the issue, and seems like we already have a celery in place which runs the methods in case of missing captions and transcripts. The following code, executes a celery which attempts to fetch existing transcript and captions, and if not found, utilises the 3play api to get them. https://github.com/mitodl/ocw-studio/blob/c0f96b807ad78dbf4873bf46065337c80d47dd86/main/settings.py#L699

I think we might have to go with this

As for what to do: I think "convert SRT -> VTT" would work fine for 3.091 2018.

What do you think about this, @ChristopherChudzicki ?

pt2302 commented 1 year ago

@fakhar-ud-din Could you please explain your proposal in more detail? Why is the task not finding the transcripts on 3Play? According to https://github.com/mitodl/ocw-studio/blob/b636e91ef2d2fbbaae43d4dd8c0cd3a0fa61b427/main/settings.py#L617 it should be running every 12 hours, and there does not appear to be an override for this value in production.

fakhar-ud-din commented 1 year ago

Seems like the celery is working fine.

https://github.com/mitodl/ocw-studio/blob/210b230d65ecc1cee601054e64cb6adc1cf131e2/videos/threeplay_api.py#L153

Assuming most/all legacy captions are on 3Play, this seems like the most straightforward solution.

I tried running the above code in the heroku's ocw-studio (production) shell, and got empty results. Moreover, the 3.091 fall 2018 seems to be missing when running the following queries

Screenshot 2023-03-14 at 4 58 01 PM Screenshot 2023-03-14 at 5 32 25 PM

I am still looking into it, will keep you updated

fakhar-ud-din commented 1 year ago

I have found a potential solution for this.

Found the video related content in WebsiteContent. And all those missing the captions and transcripts actually does not have a defined parent (Video object).

Based on the content, we will making Video objects for that which will establish the missing link for all the legacy transferred videos. I am starting to confirm my findings, and will proceed with the implementation shortly.

pdpinch commented 1 year ago

@fakhar-ud-din can you include some more details, ideally referencing objects in the django database?

@pt2302 wrote a management command for linking captions resources to video resources (PR https://github.com/mitodl/ocw-studio/pull/1670). Can that help here?

pt2302 commented 1 year ago

@pdpinch Logic similar to that management command would be useful here, but we would need to modify it to include a 3Play lookup to match video resources to captions/transcripts, which may or may not all be in Studio (and so may need to be imported from 3Play as well). That management command addressed the specific case where there is another course that has the captions/transcripts properly linked to the same YouTube videos.

Also, I was able to find captions for this particular course on 3Play.

fakhar-ud-din commented 1 year ago

Hey @pdpinch / @pt2302

I ssh'ed into the heroku shell for ocw-studio, and I filtered the current website (3-091-introduction-to-solid-state-chemistry-fall-2018) which had no Video object.

Screenshot 2023-03-16 at 4 40 26 PM

If it exists, we can download the captions/transcript and associate them with a given video by something like this function:

https://github.com/mitodl/ocw-studio/blob/210b230d65ecc1cee601054e64cb6adc1cf131e2/videos/threeplay_api.py#L153

The above will not work, since there is no Video object present to associate the fetched transcripts to. Moreover, when requested the 3play transcripts using this course's youtube_ids, majority of them had an empty data object.

Screenshot 2023-03-16 at 4 40 50 PM

I also queried the database for video_captions_file from the video resources of existing website's content. Only 122 video resources had video_captions_file

Screenshot 2023-03-16 at 4 59 33 PM

My proposed solution is, to make a management command, which creates Video objects for a website, if not already present, then updates the video object's details (_webvtt_transcript_file, website, pdf_transcriptfile) with existing caption files. If we are unable to find a caption file for some video resource, we will incorporate a request to 3play and will update the video object details with received response.

Please let me know if I missed anything

Thank You

pt2302 commented 1 year ago

I think that creating the Video object as an intermediate step is not necessary; we could do the associations directly in the WebsiteContent objects for the videos, as in this PR: https://github.com/mitodl/ocw-studio/pull/1670.

fakhar-ud-din commented 1 year ago

Oh then that simplifies it. Thank you I will proceed with the solution and make PR by EOD today

pdpinch commented 1 year ago

If we are unable to find a caption file for some video resource, we will incorporate a request to 3play and will update the video object details with received response.

How will you try to find the caption file for the video resource? If it is very complicated, then it might be easier to just make the request to 3Play.

fakhar-ud-din commented 1 year ago

I have a PR which addresses this issue https://github.com/mitodl/ocw-studio/pull/1717

How will you try to find the caption file for the video resource? If it is very complicated, then it might be easier to just make the request to 3Play.

Hey @pdpinch, Every WebsiteContent object has a metadata and a resourcetype. For a video resource, we can identify the existence of caption or transcript files by file path set against "video_captions_file" and "video_transcript_file" respectively. We query all the objects missing either of those, and request the content using 3play API.

pdpinch commented 1 year ago

I don't think the management command has been run yet on this course, 3.091 Fall 2018, so I'm reopening this issue until it has been fixed up.

fakhar-ud-din commented 1 year ago

@pdpinch , I tried running the command in production, and it ran successfully. It created new content objects for the caption files which were fetched. Although it ran successfully, I think we are facing a similar issue (files not uploading to s3) which is being discussed in the slack channel as well.