Okay, so I've long been thinking of how to best model mappings between Hummingbird and external website's IDs. The following is pretty much a brain-dump of what I've reasoned out so far, because I want to get it in text and in front of other people.
Background
A lot of clients need or want to map data between MyAnimeList and Hummingbird. For the most part, our database is lined up with MAL's, but this has a number of downsides, not least of which being that we really can't control our own database. What if we wanted to merge/split shows differently from MAL?
The goal of this RFC is to decouple our database from external influences, but still make it easy to remap data from their DB back onto ours (or vice versa). For most cases, an automatic direct mapping by matching title will suffice, but in some cases we need to be explicit because our titles differ, especially for sites without MAL heritage (older listing sites as well as streaming sites).
Mapping Schema
external_site (String, required) "myanimelist/manga" The identifier for the external site. In sites which have namespaces for their IDs, this should contain the namespace as well.
external_id (String, required) "3412" The external identifier or key.
external_units (Range, optional) 1..6 The episodes/chapters on the external site which we wish to explicitly map.
media (Media, required) "anime", 2348 - a polymorphic association with Media to map to
internal_units (Range, optional) 7..12 The episodes/chapters on Hummingbird which we wish to explicitly map to
Where external_units or internal_units is nil, it should be treated as representing the entire media's length.
We would want to validate uniqueness of site+namespace+id+units, where units don't overlap. This can be achieved in Postgres with a GIST index using the btree_gist extension. We would also want to make certain that external_units.length == internal_units.length, otherwise the mappings would get really zany.
An easy way to understand this is with the following:
Basically, you pass in the foreign keys and a range (which will often be 1..something), and it returns an array of media, units tuples representing what was watched and how much of it was watched.
Examples
To understand how this would work, an example is helpful. For starters, let's imagine that we decided to merge Durarara!! x2's multiple series into one multi-season series. The mappings would look something like this:
For brevity, we omit {external_site: 'myanimelist', external_namespace: 'anime'}
Mappings.lookup would take an external range and turn it into an array of N internal ranges. This sounds difficult, but is really quite simple. First, intersect all the ranges. Then, map them onto the target range.
class Mappings
def self.lookup(site:, id:, units:, namespace: nil)
# Get all mappings which overlap with the external key on the listed units
rows = where(external_site: site, external_namespace: namespace, external_id: id)
.where('external_range && ?', [units]).order('lower(external_range) ASC').all
# Intersect them and turn them into a more manageable hash format
mappings = rows.map do |row|
{media: row.media, units: row.external_range & units, mapping: row}
end
# Map to the target range
mappings.map do |m|
{media: m[:media], units: m[:units].remap(m[:mapping].external_range => m[:mapping].internal_range)}
end
end
end
Mapping.reverse_lookup would do the same but in the other direction. The easiest way to reason about this is that it's a direct 1:1 mapping between sets of episodes/chapters on other sites and Hummingbird.
On conversion to contiguous libraries
We can't handle watching non-contiguous ranges of episodes in libraries! That's a problem for MAL imports, isn't it? Luckily we can just do ranges.map { |r| r.length }.reduce(:+) and magically we know how many they've seen.
To convert back from a contiguous library entry to a noncontiguous set of episodes, we have to iterate through each mapping.
All of this is a longer-term thought, and in the short term I'll probably create a simpler Mappings table with only direct 1:1 mapping, just so we can remove myanimelist_id, ann_id, tvdb_series_id, and tvdb_season_id from our anime table.
Okay, so I've long been thinking of how to best model mappings between Hummingbird and external website's IDs. The following is pretty much a brain-dump of what I've reasoned out so far, because I want to get it in text and in front of other people.
Background
A lot of clients need or want to map data between MyAnimeList and Hummingbird. For the most part, our database is lined up with MAL's, but this has a number of downsides, not least of which being that we really can't control our own database. What if we wanted to merge/split shows differently from MAL?
The goal of this RFC is to decouple our database from external influences, but still make it easy to remap data from their DB back onto ours (or vice versa). For most cases, an automatic direct mapping by matching title will suffice, but in some cases we need to be explicit because our titles differ, especially for sites without MAL heritage (older listing sites as well as streaming sites).
Mapping
Schema"myanimelist/manga"
The identifier for the external site. In sites which have namespaces for their IDs, this should contain the namespace as well."3412"
The external identifier or key.1..6
The episodes/chapters on the external site which we wish to explicitly map."anime", 2348
- a polymorphic association with Media to map to7..12
The episodes/chapters on Hummingbird which we wish to explicitly map toWhere external_units or internal_units is nil, it should be treated as representing the entire media's length.
We would want to validate uniqueness of site+namespace+id+units, where units don't overlap. This can be achieved in Postgres with a GIST index using the
btree_gist
extension. We would also want to make certain thatexternal_units.length == internal_units.length
, otherwise the mappings would get really zany.An easy way to understand this is with the following:
(internal_type, internal_id, internal_units) <=> (external_site, external_id, external_units)
That is, the Mappings table represents a bidirectional mapping of two ranges.
Model Interface
Basically, you pass in the foreign keys and a range (which will often be
1..something
), and it returns an array ofmedia, units
tuples representing what was watched and how much of it was watched.Examples
To understand how this would work, an example is helpful. For starters, let's imagine that we decided to merge Durarara!! x2's multiple series into one multi-season series. The mappings would look something like this:
For brevity, we omit
{external_site: 'myanimelist', external_namespace: 'anime'}
Now let's imagine that we decided to split something. I'm gonna go with 5cm/s for this example:
Mapping.lookup
AlgorithmMappings.lookup
would take an external range and turn it into an array of N internal ranges. This sounds difficult, but is really quite simple. First, intersect all the ranges. Then, map them onto the target range.Mapping.reverse_lookup
would do the same but in the other direction. The easiest way to reason about this is that it's a direct 1:1 mapping between sets of episodes/chapters on other sites and Hummingbird.On conversion to contiguous libraries
We can't handle watching non-contiguous ranges of episodes in libraries! That's a problem for MAL imports, isn't it? Luckily we can just do
ranges.map { |r| r.length }.reduce(:+)
and magically we know how many they've seen.To convert back from a contiguous library entry to a noncontiguous set of episodes, we have to iterate through each mapping.
All of this is a longer-term thought, and in the short term I'll probably create a simpler Mappings table with only direct 1:1 mapping, just so we can remove
myanimelist_id
,ann_id
,tvdb_series_id
, andtvdb_season_id
from our anime table.