pr0pz / scene-release-parser

A library for parsing scene release names into human readable data.
24 stars 3 forks source link

Ambigious TV-Shows such as "Wilfred.US" wrecks title parsing #4

Closed dezza closed 4 months ago

dezza commented 6 months ago

Hello.

Nice lib, but there is one issue I found that I think needs to be fixed, I'll gladly help as long as we can agree on the issue.

For example Wilfred exists as both an AU and US show.

AU (first released, 2007)

https://www.themoviedb.org/tv/3297

US (2011)

https://www.themoviedb.org/tv/39525-wilfred

This means that now the title is parsed as Wilfred US.

It would be a safe assumption to think that any tag in capitalized country-code US|UK|AU|NZ|CA would mean ambigous titles and narrowing down to the specific show in respective country.

Of course the rare occassion could happen that some title would be.. Toys.R.Us, but unlikely that it would be capitalized.. If so thats a real corner-case not worth optimizing for!

https://scenerules.org/html/2020_WDX_unformatted.html

    19.8) Different shows with the same title produced in different countries must have the ISO 3166-1 alpha 2 country code in  the show name.
        19.8.1) Except for UK shows, which must use UK, not GB.
        19.8.2) This rule does not apply to an original show, only shows that succeed the original.
                e.g. The.Office.S01E01 and The.Office.US.S01E01.
pr0pz commented 6 months ago

The way it's working right now is actually intended. So parsing "The.Office.US..." to "The Office US" is correct

But that's not your expectation correct? How would you expect to access this specific data?

dezza commented 6 months ago

The way it's working right now is actually intended.

So parsing "The.Office.US..." to "The Office US" is correct

But that's not your expectation correct?

How would you expect to access this specific data?

Ah ok I see what you mean.

Well I expect "title" to be searchable by imdb/themoviedb thats simply why..

I guess release country could be a field

pr0pz commented 6 months ago

Yeah, I get your expectation, that's the reason why I initially made the release parser ;D

I'm still thinking about this

dezza commented 6 months ago

I wrote some logic for this that I think makes sense. I think you will be able to tell from it how I think the most reasonable way to handle it would be.

If next last word is not the its definetily not "referring to an actual country"

/**
 * @param {SceneTags} scenetags 
 */
function stripTVShowCountry(scenetags) {
  const lastElement = -1
  const words = scenetags.title.split(' ')
  if (scenetags.type === 'tvshow' &&
      words.at(lastElement)?.match(/(?<country>US|UK|NZ|AU|CA)/u) &&
      words.at(lastElement-1) !== 'the'
   ) {
    scenetags.title = words.slice(0, lastElement).join(' ')
  }
  return scenetags
}

// Ends with country
console.log("Ends with country")
console.log(stripTVShowCountry(null, {title: 'Wilfred US', type: 'tvshow'}))
console.log(stripTVShowCountry(null, {title: 'Oy mate Crocodile Hunter AU', type: 'tvshow'}))

console.log()

// Ends with actual country, next last is "the". Concludes its a real title
console.log("Ends with country, next last is 'the'. Concludes its a real title")
console.log(stripTVShowCountry(null, {title: 'Soldiers in the US', type: 'movie'}))
console.log(stripTVShowCountry(null, {title: 'Food in the US', type: 'tvshow'}))
console.log(stripTVShowCountry(null, {title: 'Queen of the UK', type: 'tvshow'}))

Example:

Output

Ends with country
{ title: 'Wilfred', type: 'tvshow' }
{ title: 'Oy mate Crocodile Hunter', type: 'tvshow' }

Ends with country, next last is 'the'. Concludes its a real title
{ title: 'Soldiers in the US', type: 'movie' }
{ title: 'Food in the US', type: 'tvshow' }
{ title: 'Queen of the UK', type: 'tvshow' }
pr0pz commented 6 months ago

Thanks for the code and yes, it's a good point to catch some words (like "the") before the country code. I'll dive a little bit into it to check for other possible special words.

pr0pz commented 4 months ago

So, I'm finally implementing this. Gonna add "country" as a new field. Your code really helped, adapted it to JS and PHP, tests are looking good.

pr0pz commented 4 months ago

Done with latest release

https://github.com/pr0pz/scene-release-parser/releases/tag/v1.5.0