learningequality / ka-lite-content-packs

BSD 2-Clause "Simplified" License
1 stars 6 forks source link

Removing duplicate youtube ids creates empty subtree structures #59

Open ralphiee22 opened 7 years ago

ralphiee22 commented 7 years ago

There are several parts in the code where we remove duplicate youtube ids from the node_data. This may not exactly be the behavior we want as there can be the same video under different topics. For example, there might be a calculus math video under the math subtopics, but that same calculus video could be under the physics subtopics. The same could be possible for exercise and topic nodes, where we also remove duplicate ids. I have noticed a few areas where we do this. I am not sure if there is more. https://github.com/learningequality/ka-lite-content-packs/blob/master/contentpacks/utils.py#L710 https://github.com/learningequality/ka-lite-content-packs/blob/master/contentpacks/khanacademy.py#L579

mrpau-richard commented 7 years ago

Hi @ralphiee22 I have example data of duplicate video ids x52405acc and the topic, exercises are using it. I think the cause of duplicate video ids is we have two sources the khan academy api and the dubbed videos mappings If we don't remove the duplicate video ids, It will have an issue when downloading it using the KA Lite topic tree.

videos [
  {
      "readableId": "strong-acids-and-strong-bases",
      "kind": "Video",
      "translatedYoutubeLang": "en",
      "description": "Common strong acids and strong bases. Examples of calculating the pH of a nitric acid solution, sodium hydroxide solution, and a calcium hydroxide solution.",
      "title": "Strong acids and strong bases",
      "relativeUrl": "/video/strong-acids-and-strong-bases",
      "imageUrl": "https://cdn.kastatic.org/googleusercontent/O_wnP05v7400z9ctd3Kc3ZgfmoSmV4CbrcV3AGqxZ_D81jw9RKgYWsFhjO0ASCSHXiXcrKnCAuLOCuLI9R9XSURH",
      "downloadSize": 20319242,
      "licenseName": "CC BY-NC-SA (KA default)",
      "slug": "strong-acids-and-strong-bases",
      "duration": 752,
      "sha": "560ba2cb54c05f8bcfbd05a17bb0e27be9e8e58b",
      "youtubeId": "gsu4gjrFApA",
      "keywords": "",
      "id": "x52405acc"
  }
  {
      "readableId": "strong-acids-and-strong-bases",
      "kind": "Video",
      "translatedYoutubeLang": "en",
      "description": "Common strong acids and strong bases. Examples of calculating the pH of a nitric acid solution, sodium hydroxide solution, and a calcium hydroxide solution.",
      "title": "Strong acids and strong bases",
      "relativeUrl": "/video/strong-acids-and-strong-bases",
      "imageUrl": "https://cdn.kastatic.org/googleusercontent/O_wnP05v7400z9ctd3Kc3ZgfmoSmV4CbrcV3AGqxZ_D81jw9RKgYWsFhjO0ASCSHXiXcrKnCAuLOCuLI9R9XSURH",
      "downloadSize": 20319242,
      "licenseName": "CC BY-NC-SA (KA default)",
      "slug": "strong-acids-and-strong-bases",
      "duration": 752,
      "sha": "560ba2cb54c05f8bcfbd05a17bb0e27be9e8e58b",
      "youtubeId": "gsu4gjrFApA",
      "keywords": "",
      "id": "x52405acc"
  }
]

Exercise [
  {
        "kind": "Exercise",
        "displayName": "Identifying weak bases and strong bases",
        "name": "identifying-weak-bases-and-strong-bases",
        "title": "Identifying weak bases and strong bases",
        "prerequisites": [

        ],
        "allAssessmentItems": [
          {
            "sha": "b8ad0121de2fbcbb26753dc9d791282fd3a6504c",
            "live": true,
            "id": "x1e099679bca01b04"
          },
          {
            "sha": "0572bfccd4969b6435b374e9ae3ea6e4501d0f58",
            "live": true,
            "id": "x06c9774ebb4ecdc9"
          },
          {
            "sha": "745d6665526e1a60806a59422b9ecb17fbb9a53c",
            "live": true,
            "id": "x5f4b513a8bdf14fc"
          },
          {
            "sha": "ab16e563321e2c360d8cb9629b0b98fffcbb9551",
            "live": true,
            "id": "x4b6203a80dc44d77"
          },
          {
            "sha": "9070840d411d3e841683134ed81a07fb86cf87a4",
            "live": true,
            "id": "x4cf3d34751c8dc8b"
          },
        ],
        "curatedRelatedVideos": [
          "x52405acc"
        ],
        "id": "x2af1241b",
        "usesAssessmentItems": true,
        "fileName": null,
        "slug": "identifying-weak-bases-and-strong-bases",
        "description": "Practice identifying weak bases and strong bases"
      },
]
Topic [
  {
    "kind": "Topic",
    "doNotPublish": false,
    "hide": false,
    "description": "What makes a compound acidic or basic? We will learn about the different definitions for acids and bases, and how we measure how acidic or basic a substance is. We will be putting our chemical equilibrium knowledge to good use when we look at the reactivity of weak acids and bases. ",
    "title": "Acids, bases, and pH",
    "deleted": false,
    "slug": "acids-and-bases",
    "childData": [
      {
        "kind": "Article",
        "id": "xcc04d7f8"
      },
      {
        "kind": "Video",
        "id": "373050"
      },
      {
        "kind": "Video",
        "id": "x52405acc"
      },
      {
        "kind": "Video",
        "id": "x13d9aabd"
      },
      {
        "kind": "Exercise",
        "id": "xec371455"
      },
      {
        "kind": "Exercise",
        "id": "x2af1241b"
      }
    ],
    "id": "xb79ef290"
  }
]
jamalex commented 7 years ago

I've looked into this a bit. It seems the primary problem is not in the duplicate ID removal code, but in the way we use the /v2/topictree endpoint, which only returns one instance of each video, when in fact many videos occur in multiple places in the topic tree. For example, the following videos are all identical (and have the same YouTube ID) but show up in 3 different topics:

Internally they just store this as a single node, and the problem is that in the /v2/topictree data, they don't tell us all the different parent topics a content node has. We extract the parent topic based on "path", but that's probably just one of the possibly multiple paths the node occurs under.

We'll definitely want to fix this, at least for Kolibri, but to do so we'll need to figure out some other way of querying the full set of topics a content node belongs to.