hummingbird-me / kitsu-tools

:hammer: The tools we use to build Kitsu, the coolest platform for anime and manga
https://kitsu.app
Apache License 2.0
2.09k stars 264 forks source link

Mappings RFC #709

Open NuckChorris opened 8 years ago

NuckChorris commented 8 years ago

Okay, so I've long been thinking of how to best model mappings between Hummingbird and external website's IDs. The following is pretty much a brain-dump of what I've reasoned out so far, because I want to get it in text and in front of other people.

Background

A lot of clients need or want to map data between MyAnimeList and Hummingbird. For the most part, our database is lined up with MAL's, but this has a number of downsides, not least of which being that we really can't control our own database. What if we wanted to merge/split shows differently from MAL?

The goal of this RFC is to decouple our database from external influences, but still make it easy to remap data from their DB back onto ours (or vice versa). For most cases, an automatic direct mapping by matching title will suffice, but in some cases we need to be explicit because our titles differ, especially for sites without MAL heritage (older listing sites as well as streaming sites).

Mapping Schema

Where external_units or internal_units is nil, it should be treated as representing the entire media's length.

We would want to validate uniqueness of site+namespace+id+units, where units don't overlap. This can be achieved in Postgres with a GIST index using the btree_gist extension. We would also want to make certain that external_units.length == internal_units.length, otherwise the mappings would get really zany.

An easy way to understand this is with the following:

(internal_type, internal_id, internal_units) <=> (external_site, external_id, external_units)

That is, the Mappings table represents a bidirectional mapping of two ranges.

Model Interface

Mapping.lookup(site: 'myanimelist/manga', id: '12345', units: 1..27)
# => [{:media => #<Manga ...>, :units => 1..6}, {:media => #<Manga ...>, :units => 7..12}, ...]`
Mapping.lookup(site: 'myanimelist/manga', id: '54321', units: 1..12)
# => [{:media => #<Manga ...>, :units => 13..25}]
Mapping.reverse_lookup(site: 'myanimelist/manga', media: #<Manga ...>, units: 13..25)
# => [{:site => 'myanimelist/manga', :id => '54321', units: 1..12}]

Basically, you pass in the foreign keys and a range (which will often be 1..something), and it returns an array of media, units tuples representing what was watched and how much of it was watched.

Examples

To understand how this would work, an example is helpful. For starters, let's imagine that we decided to merge Durarara!! x2's multiple series into one multi-season series. The mappings would look something like this:

For brevity, we omit {external_site: 'myanimelist', external_namespace: 'anime'}

[
  {external_id: '3412', external_units: Infinity..Infinity, media: #<Anime 'Durarara!! x2'>, target_units: 1..12},
  {external_id: '4412', external_units: Infinity..Infinity, media: #<Anime 'Durarara!! x2'>, target_units: 13..24},
  {external_id: '5412', external_units: Infinity..Infinity, media: #<Anime 'Durarara!! x2'>, target_units: 25..36}
]

Now let's imagine that we decided to split something. I'm gonna go with 5cm/s for this example:

[
  {external_id: '2205', external_units: 1..1, media: #<Anime '5cm/s: Part 1'>, target_units: 1..1},
  {external_id: '2205', external_units: 2..2, media: #<Anime '5cm/s: Part 2', target_units: 1..1},
  {external_id: '2205', external_units: 3..3, media: #<Anime '5cm/s: Part 3', target_units: 1..1}
]

Mapping.lookup Algorithm

Mappings.lookup would take an external range and turn it into an array of N internal ranges. This sounds difficult, but is really quite simple. First, intersect all the ranges. Then, map them onto the target range.

class Mappings
  def self.lookup(site:, id:, units:, namespace: nil)
    # Get all mappings which overlap with the external key on the listed units
    rows = where(external_site: site, external_namespace: namespace, external_id: id)
          .where('external_range && ?', [units]).order('lower(external_range) ASC').all
    # Intersect them and turn them into a more manageable hash format
    mappings = rows.map do |row|
      {media: row.media, units: row.external_range & units, mapping: row}
    end
    # Map to the target range
    mappings.map do |m|
      {media: m[:media], units: m[:units].remap(m[:mapping].external_range => m[:mapping].internal_range)}
    end
  end
end

Mapping.reverse_lookup would do the same but in the other direction. The easiest way to reason about this is that it's a direct 1:1 mapping between sets of episodes/chapters on other sites and Hummingbird.

On conversion to contiguous libraries

We can't handle watching non-contiguous ranges of episodes in libraries! That's a problem for MAL imports, isn't it? Luckily we can just do ranges.map { |r| r.length }.reduce(:+) and magically we know how many they've seen.

To convert back from a contiguous library entry to a noncontiguous set of episodes, we have to iterate through each mapping.


All of this is a longer-term thought, and in the short term I'll probably create a simpler Mappings table with only direct 1:1 mapping, just so we can remove myanimelist_id, ann_id, tvdb_series_id, and tvdb_season_id from our anime table.