muttsnutts / mp4autotag

simple metadata for your mp4 movies and tv shows
2 stars 1 forks source link

Regexp for filename detection #2

Open muttsnutts opened 11 years ago

muttsnutts commented 11 years ago

Todo: (updated 2012.10.07)


Improve accuracy of searching based upon file name.

Currently failing on conventional scene-naming-schemes:

Notes:

popmedic commented 11 years ago

I think I got this one with: [POPMp4FileTagSearch.m]

68: //set the search string by the filename.
69: search_str = [[[tag filename] lastPathComponent] stringByDeletingPathExtension];
70: search_str = [search_str stringByReplacingOccurrencesOfString:@"." withString:@" "];
71: search_str = [search_str stringByReplacingOccurrencesOfString:@"_" withString:@" "];

any other chars you can think of that just need to be removed or replaced?

muttsnutts commented 11 years ago

I'm gonna look into this more over the next week. Mainly to see how xbmc, plex, subler handle the various naming conventions.

I'll create a list of file names from various trackers and see how they match with each regexp.

To me, this is a VERY important step, if we can get it to match 90% of filenames correctly, this will make the end user happy.

popmedic commented 11 years ago

Yes, yes, very important, actually I would say this is like 60% of what the application does, maybe even more. It is easy to see my current algorithm for this:

change "." and "_" to " "
Look for /\([0-9]{4}\)/
  *found: its a movie, everything before the regex is the movie name.
  *not found: 
    Look for /E[0-9]+ /
    *found: Its a TV show!
      Look for /\-{0,1} *S[0-9]+E[0-9]+ *\-{0,1}/   
      *found: 
        use the numbers that follow S as the season, use the numbers that follow E as the episode.
        see if there is anything before the S, if so that is the show name.
      Check the directory that the file is in agianst "/season *[0-9]+/"
      *found: 
        if we did not get a season from the file name use the [0-9]+ as the season.
        if we did not get a show name, use the directory above this one as the show name
      *not found:
        if we did not get a show name, use this directory as the show name
Nothing matches, use the filename and search the Movie database with it.
Search the appropriate database.

obviously a really simple not complete algorithm, but hopefully this outline will give us something to augment to get a more inclusive search.

First problem I see is: LOOK FOR THE TV SHOW STUFF FIRST. Why? Because some files will use ([0-9]{4}) in the show name to distinguish between newer versions of the show and older ones. Augmentation of the algorithm:

change "." and "_" to " "
Look for /E[0-9]+ /
*found: Its a TV show!
  Look for /\-{0,1} *S[0-9]+E[0-9]+ *\-{0,1}/   
  *found: 
   use the numbers that follow S as the season, use the numbers that follow E as the episode.
   see if there is anything before the S, if so that is the show name.
   Check the directory that the file is in agianst "/season *[0-9]+/"
   *found: 
     if we did not get a season from the file name use the [0-9]+ as the season.
     if we did not get a show name, use the directory above this one as the show name
     *not found:
       if we did not get a show name, use this directory as the show name
*not found:
  Look for /\([0-9]{4}\)/
    *found: 
      its a movie, everything before the regex is the movie name.
      [0-9]{4} is the year of the movie.
    *not found: 
      Nothing matches, use the filename and search the Movie database with it.
Search the appropriate database.

Kevin.

muttsnutts commented 11 years ago

Sorry I haven't gotten to this yet. It's breast cancer month, which means lots of after work non-profit events for me (not that I'm complaining).

I think that your workflow is solid. A few thoughts/questions:

  1. Does replacing the period . adversely affect searching for something like "Tosh.0"
  2. How does the tvdb api handle other special characters colon :, apostrophe ', dash -, ampersand &, exclamation mark !?
  3. When assuming a movie with /\([0-9]{4}\)/, I think we need to next confirm that the number is greater than > 1900. It turns out some groups name their shows like 0113 instead of S01E13 sigh Also per yahoo the first motion picture ever created was in 1878 =P

Also "LOOK FOR THE TV SHOW STUFF FIRST. Why? Because some files will use ([0-9]{4}) in the show name to distinguish between newer versions of the show and older ones.".... You were referring to "Teenage Mutant Ninja Turtles 2012" right ?!? hahaha

Just to prove that there are lots of shows with special characters, here are just a few:

colon : NCIS: Los Angeles, Star Wars: The Clone Wars, Anthony Bourdain: No Reservations, CSI: NY apostrophe ' Fast N' Loud, Bob's Burgers, Kickin' It, How It's Made, Grey's Anatomy dash - Hawaii Five-0, Ultimate Spider-man ampersand & Mike & Molly, Brothers & Sisters, Law & Order exclamation mark ! American Dad!, Superjail!

popmedic commented 11 years ago

Well,

I put all the search logic into a cgi script on popmedic.com so that all searches are proxy though there. This should be faster and makes it so the file detection logic can be done in ruby. This option is on by default.

the new logic is below:

    serstr = ''
    seastr = '0'
    epistr = '0'
    is_movie = true
    #check to see if this basestr is a show
    #first check for / e([0-9]+)/i
    if((md = /e([0-9]+)/i.match(basestr)) != nil)
      is_movie = false
      epistr = md[1]
      #see if we have a series name...
      if((md = /(.+) e[0-9]+/i.match(basestr)) != nil)
        serstr = md[1].strip
      end
      #see if there is a /s([0-9]+)/i for a season...
      if((md = /s([0-9]+)/i.match(basestr)) != nil)
        seastr = md[1]
        #see if we have a series name...
        if((md = /(.+) s[0-9]+ *e[0-9]+/i.match(basestr)) != nil)
          serstr = md[1].strip
        end
      end
    #maybe we have a / ([0-9]+)x([0-9]+)/i, could be a SxE...
    elsif((md = /([0-9]+)x([0-9]+)/i.match(basestr)) != nil)
      is_movie = false
      epistr = md[2]
      seastr = md[1]
      #see if we have a series name...
      if((md = /(.+) [0-9]+x[0-9]+/i.match(basestr)) != nil)
        serstr = md[1].strip
      end
    #maybe we have a /([0-9]{4})/ could be a date, or a SSEE...
    #elsif((md = /([0-9]{4})/i.match(filename_str)) != nil)
    end
    #if we don't have a movie and we don't have a seastr and we have a parent dir string, check the parent...
    if(is_movie == false && parentdir_str != nil)
      #see if the parentdir_str has /season ([0-9]+)/
      if((md = /season ([0-9]+)/i.match(parentdir_str)) != nil)
        if(grandparentdir_str != nil)
          parentdir_str = grandparentdir_str
        end
        if(seastr == '0')
          seastr = md[1]
        end
      end
    end
    #if we don't have a movie and we don't have a serstr and we have a parent dir string, make the serstr tha parent...
    if(is_movie == false && serstr == '' && parentdir_str != nil)
      serstr = parentdir_str
    end
    rtn = []
    movstr = ''
    yearstr= ''
    #if we have a movie, do a movie search
    if(is_movie)
      if((md = /(.+) {0,1}\({0,1}([0-9]{4})\){0,1}/i.match(basestr)) != nil)
        movstr = md[1].chomp("(")
        movstr.chomp!(" ")
        yearstr = md[2].chomp
      else
        movstr = basestr
      end
      rtn = Search.movie_search(basestr, movstr, yearstr)
    #otherwise do a show search
    else
      rtn = Search.show_search(basestr, serstr, seastr, epistr)
    end
    #if we still have nothing, and we did not do a movie search...
    if(rtn.count == 0 && !is_movie)
      rtn = Search.movie_search(basestr)
    end

    #now if it is a use_itunes request, 
    if(use_itunes == 1)
      #get the images from itunes
      rtn.each do |tag|
        if(tag["Media Type"]["value"] == 'tvshow')
          img_path = SearchITunes.get_image({"serstr" => tag["TV Show"]['value'], "seastr" => tag['TV Season']['value']}, false)
        else
          img_path = SearchITunes.get_image({"movstr" => tag["TV Show"]['value'], "yearstr" => tag['Release Date']['value'].to_i().to_s()}, true)
        end
        #self.dbug(img_path)
        if(img_path!=nil)
          if(img_path != "")
            tag["Image Path"] = img_path
          end
        end
      end
    end
popmedic commented 11 years ago

So, I haven't heard from Mutts Nutts for like 3 months, so I am on my own with this project again (and now it is under his github account.) Anyway, I had an idea and ran with it. I figured moving the filename search logic and web queries to a proxy cgi/server on a webhost would speed the searches up and lighten the load on ones personal computer. Also, and most importantly, this makes it so as I improve the logic, users will not have to update the client. In addition, this chunk of code is written in ruby (my host does not allow me to run webrick on the server, so I made it a cgi. There is also a mp4autotag_server.rb that will run the server as a standalone.) Rube is a superior language to objective c when it comes to string parsing and simplicity, so this makes modifying and perfecting the search logic easier and faster. I left all the original search logic in the client application and added a preference to use the popmedic search proxy. This preference is on by default because I would rather have the users use this proxy then the application logic that can only be updated by a client update. It also leaves me a great way for adding ad support if following get good enough and I think could make a little $$$ off this work.

This process is finished and after adding some slick server side caching I have successfully speed up the searches and implemented this design. Please take a look at the code, especially the server side ruby scripts, I think they are slick and the logic works well so far.