learningequality / ka-lite

KA Lite: lightweight web server for serving core Khan Academy content (videos and exercises) without needing internet connectivity
https://learningequality.org/ka-lite/
Other
458 stars 305 forks source link

New content pack causes issues w/ displayed figures in Video Manager #5561

Open benjaoming opened 6 years ago

benjaoming commented 6 years ago

Summary

It seems like the duplication of video nodes is somehow causing a glitch. I can hardly believe that 11,406 videos have been removed and only 394 added in the video set.

image

System information

Traceback or relevant snippet from server.log

n/a

benjaoming commented 6 years ago

A quick verification...

I have 9855 videos on my drive from the previous content pack.

Videos that are marked as available by videoscan:

SELECT COUNT(*) FROM (SELECT COUNT(youtube_id) from item WHERE available=1 GROUP by youtube_id)
> 8879

Videos that are marked as unavailable by videoscan:

SELECT COUNT(*) FROM (SELECT COUNT(youtube_id) from item WHERE available=1 GROUP by youtube_id)
> 450

This suggests that neither the function that summarizes missing videos, nor the one that summarizes how many have been downloaded take uniqueness of the youtube_id into account.

image

benjaoming commented 6 years ago

Hi @mrpau-eugene -- would you be able to look at this? @rtibbles will help out with any questions you might have :)

mrpau-eugene commented 6 years ago

@benjaoming sure! will take a look into this one and would find a fix for it 👍

mrpau-eugene commented 6 years ago

@benjaoming seems like it doesn't account the uniqueness of each youtube_id https://github.com/learningequality/ka-lite/blob/master/kalite/updates/static/js/updates/bundle_modules/update_videos.js#L160

Should this logic be changed given that we now have duplicated youtube_ids?

There is also a problem with staging.learningequality.org coz I can't download videos. I'm always getting a 404.

benjaoming commented 6 years ago

There is also a problem with staging.learningequality.org coz I can't download videos. I'm always getting a 404.

This server is used because of using kalite.project.settings.dev, you can reconfigure it to use the real server by changing the values locally, put this in ~/.kalite/settings.py:

from kalite.project.settings.dev import *

SECURESYNC_PROTOCOL = "http"
CENTRAL_SERVER_HOST = "kalite.%s" % CENTRAL_SERVER_DOMAIN
CENTRAL_SERVER_URL = "%s://%s" % (SECURESYNC_PROTOCOL, CENTRAL_SERVER_HOST)

...and then invoke kalite without setting --settings.

mrpau-eugene commented 6 years ago

@benjaoming I just found out that this doesn't seem to be the issue with the new content packs since I also experience a glitch using the old content packs.

What I did was:

  1. Download Class 5 (India)
  2. Check the Math checkbox to select all videos inside Math

screen shot 2018-01-22 at 3 27 11 pm

But if you uncheck at least one of the videos, it will have a different value

screen shot 2018-01-22 at 3 31 59 pm

It seems to be a problem on how the remote_size of the Math topic is computed in the database. I have a guess that it has something to do when building the content packs.

screen shot 2018-01-22 at 3 35 57 pm

In the screenshot above, the remote size is just around 2.5GB which is equivalent to the size when we just select the Math topic checkbox

benjaoming commented 6 years ago

@mrpau-eugene yes, the sizes are wrong, too :) And they have been so for a while, I'm afraid that it has gone under the radar -- I can't find an issue for it.

What worries me here is that the file count is wrong, too -- which AFAIK is a new issue. And very problematic, too, because a user can select to delete videos in Topic A and then they're deleted in Topic B as well.

mrpau-eugene commented 6 years ago

@benjaoming I think I seem to have found a fix regarding the sizes of the video. The only problem so far is the file count.

If we have this kind of scenario:

Math (12 total_files)
  Early Math
    ... 5 Videos (3 unique videos)
  Algebra I
    ... 3 Videos (1 unique video)
  Algebra II
    ... 4 Videos (2 unique videos)

Should the 12 total_files stay as is, or should it be 6 total files because we only have 6 unique videos?

Regarding video deletion, I have a suggestion..

  1. Check the videos by its path instead of its youtube_id
  2. If the video is deleted, mark that video with that path as available = False
  3. Check that youtube_id if other topics are still using it. If no topic uses it, we should completely delete it, else do nothing.

This might work, however when a user performs another videoscan I think videos using those youtube_ids will be marked as available = True again so I dont think this fixes the problem at all (?)

Do you have other suggestions?

benjaoming commented 6 years ago

Should the 12 total_files stay as is, or should it be 6 total files because we only have 6 unique videos?

It should be 6, otherwise the reported number of videos to download/delete will be wrong. This is because the figure is used to show information about video download counts -- not subjects or topic names.

Regarding video deletion, I have a suggestion..

I think your suggestion is a good one!

This might work, however when a user performs another videoscan I think videos using those youtube_ids will be marked as available = True again so I dont think this fixes the problem at all (?)

I can see that it might cause some unwanted scenario: User wants to delete content A (containing video A) such that it becomes hidden in the topic tree. I don't think users are likely to sort things manually like this because: 1) You cannot delete exercises anyways, so there's lots of "read-only" content in the topic-tree and 2) There are so many videos everywhere, users are unlikely to use the Delete function for a very careful manual video curation.

I think we can also see it like this:

User navigates to Video Management in order to Delete videos. The most likely reason for this, is that a specific video is unwanted, either because of disk space or because students aren't supposed to view it. In both cases, the video should be entirely removed, so all occurrences of youtube_id should have available=False and the file removed.

This would make things easier, right? :)

mrpau-eugene commented 6 years ago

It should be 6, otherwise the reported number of videos to download/delete will be completely wrong

I'm still wondering how can this be achieved because we are only using the data from the API which has the total_files for each topic tree. I cant think of any way to fix this as of now.

I think we can also see it like this:

User navigates to Video Management in order to Delete videos. The most likely reason for this, is that a specific video is unwanted, either because of disk space or because students aren't supposed to view it. In both cases, the video should be entirely removed, so all occurrences of youtube_id should have available=False and the file removed.

This would make things easier, right? :)

I think that makes things easier. So are we gonna leave the deletion the way it works right now?

benjaoming commented 6 years ago

So are we gonna leave the deletion the way it works right now?

I think right now it only marks the contents marked for deletion as available=False so it needs an additional lookup based on the youtube_id to mark other occurrences unavailable..?

mrpau-eugene commented 6 years ago

@benjaoming when I tested awhile ago, after deleting a topic and refreshed the page, other occurences of that youtube_id were also deleted. I think I saw it also in the codebas where deleting a video is based on its youtube_id

mrpau-eugene commented 6 years ago

@benjaoming

I don't think using a raw SQL fixes the wrong file count..

SELECT COUNT(*) FROM (SELECT COUNT(youtube_id) from item WHERE available=1 GROUP by youtube_id)

Let's say we have this kind of scenario:

We won't have any problems if a user just ticks the Math Total: 50GB (assumption)

This seems to be easy since we just have to do a query like the one you posted.

But when a user only excludes one topic like this:

I don't know what would be the total, but won't we also have a weird file count? I also think this might have a complicated query too.

Though we can, let's say, query the total_files first and then just subtract the unchecked afterwards. But this has some drawbacks since we won't know whether those unchecked topics contains those duplicate youtube videos.