tboothman / imdbphp

PHP library for retrieving film and tv information from IMDb
253 stars 84 forks source link

Person ID length is always 7 #201

Closed acurrieclark closed 4 years ago

acurrieclark commented 4 years ago

Description

When searching for a person by name, the ID which is extracted by the regex is always 7 numbers long, even if the result returns a URL with 8 numbers.

It seems to me that the regex nm(\d{7,8}) matches 7 numbers and then moves along, skipping the last character.

Would (\d{8}|\d{7} work instead? Happy to put together a pull request if that helps.

Type

Bug

Expected Results / What do you want to do?

It should return an ID of length 8 if the id is actually that long

jreklund commented 4 years ago

Hi, is it the PersonSearch->search() that gives you incorrect result? What did you search for? And what person dosen't get included correctly?

That regex are fine, at least for tests goes. That regex says between 7 or 8 numbers.

acurrieclark commented 4 years ago

For example John Zuberek who has an ID length of 8 in the URL.

Running the code

<?php

declare(strict_types=1);

use Imdb\PersonSearch;

include('./vendor/autoload.php');

$search = new PersonSearch();

$results = $search->search('John Zuberek');

var_dump(array_map(function($person) { return $person->imdbid(); }, $results));

results in

array(3) {
  [0]=>
  string(7) "1137352"
  [1]=>
  string(7) "4570312"
  [2]=>
  string(7) "0958261"
}

The first of these results corresponds to John's URL and should be 11373523. The final digit is truncated.

This is running on php 7.4 but I would imagine should be the same on prior versions.

duck7000 commented 4 years ago

Just a wild guess, could it be this difference in url?

your link above: https://www.imdb.com/name/nm11373523/

This is the link when i search on imdb page: https://www.imdb.com/name/nm11373523/?ref_=nv_sr_srsg_0

Quick question though, why don't imdbphp captures all digits after nm? Or chop the url on / ?

acurrieclark commented 4 years ago

It seems to me that nm(\d{7,8}) will always match 7 digits even if 8 are present. The inclusion of the U flag is likely the issue?

Example with U flag Example without U flag

jreklund commented 4 years ago

That's correct, it's lazy by default so it stops looking after finding 7 digits, missed the U part of it yesterday. We need to get rid of the lazy flag and manually add lazy ourselves to the parts that needs it. Here's a quick fix for it, but it should probably be {7,} "seven or more" instead of just +. But that needs some more testing.

Line 58: @<a href="/name/nm(\d+)[^>]*?>([^<]+)</a>\s*?(.*?)</td>@is Line 72: @<small>\((.*),\s*<a href="/title/tt(\d+)[^>]*>(.*)</a>\s*\((\d{4})\)\)@i

Probably should re-write all {7,8} so that we don't have this kind of a problem in the future. And add some tests to this, as PersonSearch don't have any regarding 8 digits persons.

@duck7000 IMDb had 7 digits "for ever" until it run out a couple of years ago. So the quick fix where to change from {7} into {7,8} instead of re-writing it to work with every digits after tt or nm. And it's to make sure that we don't capture a too small (or large) of an url (non-valid one), as they are always minimum 7 digits. Going from 7 into 8 digits added 90 000 000 new entries (if my math are correct). Going from 10 000 000 into 100 000 000. So it would take a while for them to be used up. We would probably be dead before that happens.

As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database,[3] as well as 83 million registered users. Launched: 17 October 1990; 29 years ago https://en.wikipedia.org/wiki/IMDb

duck7000 commented 4 years ago

@jreklund perfectly clear, as always haha

But i did mean it more towards to get rid of regex, i think that regex's are a pain in the butt..

jreklund commented 4 years ago

I'm all ears. Personally I can't see how we would get rid of using regex for this, sure some part can be re-written to use DOMDocument loop over the DOM instead, that will leave us with https://www.imdb.com/name/nm11373523/?ref_=nv_sr_srsg_0 and we can't just grab all digits as we would get an extra zero at the end. We need to validate the pattern. :-)

And then there are the harder part of grabbing the extra information from it, as <small> don't have any unique DOM-elements, so we can't single those properties out (role, mid, movie, year). <td class="result_text"> <a href="/name/nm11373523/?ref_=fn_al_nm_1">John Zuberek</a> <small>(Actor, <a href="/title/tt11845260/?ref_=fn_al_nm_1a">The Interview</a> (2019))</small></td>

duck7000 commented 4 years ago

I'm thinking very simple out loud right now but would this a option: After the search for a person or movie the url is known, correct?

$pathFragments = explode('/', $url);
 $id = filter_var($pathFragments[2], FILTER_SANITIZE_NUMBER_INT);

For the small tag i have no clue jet.. Edit: added filter_var

PoLaKoSz commented 4 years ago

I'm not a regex master, but i tried with \d+? and it worked. I wrote an additional test case when searching for John Zuberek and opened a PR ( #203 ).