Closed pdpinch closed 1 year ago
Hey @pdpinch
I am working on the fix, but seems like I do not have access to the content drive yet.
The captions are unavailable since we do not have a path set for them (I have attached a screenshot for this). Could you please update me with the status regarding the access, so I could proceed with the fix?
Thank You
I will give you access, but I don't think you're going to find the missing files on gdrive. This is a legacy course, and the files should have been imported directly to S3.
Hey @pdpinch
I have tested this locally by using random lecture's caption file, and setting the RESOURCE_BASE_URL
to the s3 bucket link. This configuration is working fine locally.
I think we do not have RESOURCE_BASE_URL
on production, which causes the conditional rendering of the captions to fail.
I am still looking into this, and will keep you updated.
Thank you
I have successfully reproduced the issue locally, but I am not quite certain whether I am headed in the right direction.
RESOURCE_BASE_URL
in the env fileRESOURCE_BASE_URL
config in the env.ts fileyarn start course 3.091-fall-2018
NOTE:
video_captions_file
in ocw-content-rc/3.091-fall-2018/content/resources/lecture-1.md
should be nullRESOURCE_BASE_URL
and/ or video_captions_file
is set to something, the captions and transcripts become availableI want to mention that this is not limited to 3.091-fall-2018.
Here are a few more examples of videos that DO have high-quality transcripts on YouTube, but do NOT have associated transcripts on OCW.
@pt2302 @gumaerc and @abeglova have discussed these a little bit, but to my knowledge no concrete plan has been formed. (@pt2302 may corret me when he's back from vacation). Roughly, I believe we were thinking
I haven't used the 3play APIs myself, but that is my understanding of what @pt2302 @abeglova @gumaerc and I have discussed in the past.
Here is a list of OCW resources that do not have transcripts https://gist.github.com/ChristopherChudzicki/25a0310d2de4a568d6b19e8f009c86a8. Note:
Hey @ChristopherChudzicki
Thank you for the detailed response. Yes we actually have missing caption files for majority of the videos, and this issue is wide-spread to other courses as well.
My concern is related to the UI elements of captions and transcripts, since those should be rendering despite the availability of the captions and transcript files. Right?
The UI elements tends to disappear only when we are missing RESOURCE_BASE_URL
or This is not set.
What are your thoughts on this?
Yes we actually have missing caption files for majority of the videos,
I think it's more like ~15% of videos, but still more than just this course.
My concern is related to the UI elements of captions and transcripts, since those should be rendering despite the availability of the captions and transcript files. Right?
IMO, it's fine not to show the "CC" button if there are no captions. Similarly with the expandable transcript. If there's no transcript, no need to show the "Transcript" button. (If we do show it, and it's empty, then that's confusing. So if we do show it, we would want to show "No transcript available" or something. Might as well just not show it.)
This issue, as I understand it, is "Transcripts for 3.091 (and several other courses) used to exist in legacy ocw, but do not exist in current ocw. Let's figure out how to make those transcripts available in ocw-next."
Is it possible that this course (and the others) have a .srt resource but no .vtt resource?
IMO, it's fine not to show the "CC" button if there are no captions. Similarly with the expandable transcript. If there's no transcript, no need to show the "Transcript" button. (If we do show it, and it's empty, then that's confusing. So if we do show it, we would want to show "No transcript available" or something. Might as well just not show it.)
@ChristopherChudzicki It makes sense to not show those UI elements in case of unavailability of the transcripts and captions. Not sure, where to find the legacy ocw config and files?
Is it possible that this course (and the others) have a .srt resource but no .vtt resource?
@pdpinch not sure about the .srt but we are certainly missing .vtt for the all the resources having this issue
@pdpinch @fakhar-ud-din Regarding SRT caption files for 3.091-2018, they do exist. For example
lecture 1
video:
https://github.mit.edu/mitocwcontent/3.091-fall-2018/blob/main/content/resources/lecture-1.md
srt caption:
https://github.mit.edu/mitocwcontent/3.091-fall-2018/blob/main/content/resources/ao41frjfgvq.md
transcript:
https://github.mit.edu/mitocwcontent/3.091-fall-2018/blob/main/content/resources/ao41frjfgvq-1.md
The video is explicitly associated with the transcript but not the SRT. However, the SRT filenames appear to be a lowercased version of the youtube ID, which does exist in the video file metadata.
I believe all videos in 3.091 have SRT captions (there are 92 srt files and 82 video files in the course, so, probably...).
This is not necessarily true for other courses with missing captions, so maybe we have multiple root causes for missing captions. E.g., 5-80-small-molecule-spectroscopy-and-dynamics-fall-2008
appears to have no caption files, srt or vtt.
@fakhar-ud-din I found the above information via SQL queries on RC (which was recently synced with prod). Querying models in Django shell would also work.
As for what to do: I think "convert SRT -> VTT" would work fine for 3.091 2018. But again, there are more courses whose captions DO appear to exist on youtube, but do not have SRT files. A more general 3play approach (like sketched in https://github.com/mitodl/ocw-studio/issues/1518#issuecomment-1446392041) might be able to address the problem for more courses.
We can check whether a transcript with a given YouTube ID exists on 3Play, using this function: https://github.com/mitodl/ocw-studio/blob/210b230d65ecc1cee601054e64cb6adc1cf131e2/videos/threeplay_api.py#L118
If it exists, we can download the captions/transcript and associate them with a given video by something like this function: https://github.com/mitodl/ocw-studio/blob/210b230d65ecc1cee601054e64cb6adc1cf131e2/videos/threeplay_api.py#L153
Assuming most/all legacy captions are on 3Play, this seems like the most straightforward solution.
This is great folks. So the captions exist, they are probably in our 3play account and we have code for looking them up and fetching them.
How about we start with a management command that tries to fetch the .vtt caption file from 3play and add it the video as a new resource, using the functions Paul mentioned?
Feel free to suggest alternatives.
Thank you for the suggestions @pt2302 @ChristopherChudzicki I believe, I do not have the database dump locally, so I could not confirm using the django shell.
I will be moving forward with what @pt2302 suggested and will work on a management command which could fetch and sync available captions and transcripts.
I have been investigating the issue, and seems like we already have a celery in place which runs the methods in case of missing captions and transcripts. The following code, executes a celery which attempts to fetch existing transcript and captions, and if not found, utilises the 3play api to get them. https://github.com/mitodl/ocw-studio/blob/c0f96b807ad78dbf4873bf46065337c80d47dd86/main/settings.py#L699
I think we might have to go with this
As for what to do: I think "convert SRT -> VTT" would work fine for 3.091 2018.
What do you think about this, @ChristopherChudzicki ?
@fakhar-ud-din Could you please explain your proposal in more detail? Why is the task not finding the transcripts on 3Play? According to https://github.com/mitodl/ocw-studio/blob/b636e91ef2d2fbbaae43d4dd8c0cd3a0fa61b427/main/settings.py#L617 it should be running every 12 hours, and there does not appear to be an override for this value in production.
Seems like the celery is working fine.
Assuming most/all legacy captions are on 3Play, this seems like the most straightforward solution.
I tried running the above code in the heroku's ocw-studio (production) shell, and got empty results. Moreover, the 3.091 fall 2018
seems to be missing when running the following queries
Video.objects.filter(source_key__icontains="3-091")
Video.objects.filter(website__name="3-091-introduction-to-solid-state-chemistry-fall-2018")
I am still looking into it, will keep you updated
I have found a potential solution for this.
Found the video related content in WebsiteContent
. And all those missing the captions and transcripts actually does not have a defined parent (Video object).
Based on the content, we will making Video objects for that which will establish the missing link for all the legacy transferred videos. I am starting to confirm my findings, and will proceed with the implementation shortly.
@fakhar-ud-din can you include some more details, ideally referencing objects in the django database?
@pt2302 wrote a management command for linking captions resources to video resources (PR https://github.com/mitodl/ocw-studio/pull/1670). Can that help here?
@pdpinch Logic similar to that management command would be useful here, but we would need to modify it to include a 3Play lookup to match video resources to captions/transcripts, which may or may not all be in Studio (and so may need to be imported from 3Play as well). That management command addressed the specific case where there is another course that has the captions/transcripts properly linked to the same YouTube videos.
Also, I was able to find captions for this particular course on 3Play.
Hey @pdpinch / @pt2302
I ssh'ed into the heroku shell for ocw-studio, and I filtered the current website (3-091-introduction-to-solid-state-chemistry-fall-2018) which had no Video
object.
If it exists, we can download the captions/transcript and associate them with a given video by something like this function:
The above will not work, since there is no Video object present to associate the fetched transcripts to. Moreover, when requested the 3play transcripts using this course's youtube_ids, majority of them had an empty data object.
I also queried the database for video_captions_file
from the video resources of existing website's content.
Only 122 video resources had video_captions_file
My proposed solution is, to make a management command, which creates Video
objects for a website, if not already present, then updates the video object's details (_webvtt_transcript_file, website, pdf_transcriptfile) with existing caption files. If we are unable to find a caption file for some video resource, we will incorporate a request to 3play and will update the video object details with received response.
Please let me know if I missed anything
Thank You
I think that creating the Video
object as an intermediate step is not necessary; we could do the associations directly in the WebsiteContent
objects for the videos, as in this PR: https://github.com/mitodl/ocw-studio/pull/1670.
Oh then that simplifies it. Thank you I will proceed with the solution and make PR by EOD today
If we are unable to find a caption file for some video resource, we will incorporate a request to 3play and will update the video object details with received response.
How will you try to find the caption file for the video resource? If it is very complicated, then it might be easier to just make the request to 3Play.
I have a PR which addresses this issue https://github.com/mitodl/ocw-studio/pull/1717
How will you try to find the caption file for the video resource? If it is very complicated, then it might be easier to just make the request to 3Play.
Hey @pdpinch, Every WebsiteContent object has a metadata and a resourcetype. For a video resource, we can identify the existence of caption or transcript files by file path set against "video_captions_file" and "video_transcript_file" respectively. We query all the objects missing either of those, and request the content using 3play API.
I don't think the management command has been run yet on this course, 3.091 Fall 2018, so I'm reopening this issue until it has been fixed up.
@pdpinch , I tried running the command in production, and it ran successfully. It created new content objects for the caption files which were fetched. Although it ran successfully, I think we are facing a similar issue (files not uploading to s3) which is being discussed in the slack channel as well.
Steps to Reproduce
See, for example, https://ocw.mit.edu/courses/3-091-introduction-to-solid-state-chemistry-fall-2018/resources/lecture-1/
or https://ocw.mit.edu/courses/3-091-introduction-to-solid-state-chemistry-fall-2018/resources/carbon-dioxide-concentration-lec28/
Expected Behavior
Actual Behavior
Screenshot or Screencast