Closed acurrieclark closed 4 years ago
Hi, is it the PersonSearch->search()
that gives you incorrect result? What did you search for? And what person dosen't get included correctly?
That regex are fine, at least for tests goes. That regex says between 7 or 8 numbers.
For example John Zuberek who has an ID length of 8 in the URL.
Running the code
<?php
declare(strict_types=1);
use Imdb\PersonSearch;
include('./vendor/autoload.php');
$search = new PersonSearch();
$results = $search->search('John Zuberek');
var_dump(array_map(function($person) { return $person->imdbid(); }, $results));
results in
array(3) {
[0]=>
string(7) "1137352"
[1]=>
string(7) "4570312"
[2]=>
string(7) "0958261"
}
The first of these results corresponds to John's URL and should be 11373523
. The final digit is truncated.
This is running on php 7.4 but I would imagine should be the same on prior versions.
Just a wild guess, could it be this difference in url?
your link above: https://www.imdb.com/name/nm11373523/
This is the link when i search on imdb page: https://www.imdb.com/name/nm11373523/?ref_=nv_sr_srsg_0
Quick question though, why don't imdbphp captures all digits after nm? Or chop the url on / ?
It seems to me that nm(\d{7,8})
will always match 7 digits even if 8 are present. The inclusion of the U
flag is likely the issue?
That's correct, it's lazy by default so it stops looking after finding 7 digits, missed the U part of it yesterday. We need to get rid of the lazy flag and manually add lazy ourselves to the parts that needs it. Here's a quick fix for it, but it should probably be {7,}
"seven or more" instead of just +
. But that needs some more testing.
Line 58: @<a href="/name/nm(\d+)[^>]*?>([^<]+)</a>\s*?(.*?)</td>@is
Line 72: @<small>\((.*),\s*<a href="/title/tt(\d+)[^>]*>(.*)</a>\s*\((\d{4})\)\)@i
Probably should re-write all {7,8} so that we don't have this kind of a problem in the future. And add some tests to this, as PersonSearch don't have any regarding 8 digits persons.
@duck7000 IMDb had 7 digits "for ever" until it run out a couple of years ago. So the quick fix where to change from {7} into {7,8} instead of re-writing it to work with every digits after tt or nm. And it's to make sure that we don't capture a too small (or large) of an url (non-valid one), as they are always minimum 7 digits. Going from 7 into 8 digits added 90 000 000 new entries (if my math are correct). Going from 10 000 000 into 100 000 000. So it would take a while for them to be used up. We would probably be dead before that happens.
As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database,[3] as well as 83 million registered users. Launched: 17 October 1990; 29 years ago https://en.wikipedia.org/wiki/IMDb
@jreklund perfectly clear, as always haha
But i did mean it more towards to get rid of regex, i think that regex's are a pain in the butt..
I'm all ears. Personally I can't see how we would get rid of using regex for this, sure some part can be re-written to use DOMDocument loop over the DOM instead, that will leave us with https://www.imdb.com/name/nm11373523/?ref_=nv_sr_srsg_0
and we can't just grab all digits as we would get an extra zero at the end. We need to validate the pattern. :-)
And then there are the harder part of grabbing the extra information from it, as <small>
don't have any unique DOM-elements, so we can't single those properties out (role, mid, movie, year).
<td class="result_text"> <a href="/name/nm11373523/?ref_=fn_al_nm_1">John Zuberek</a> <small>(Actor, <a href="/title/tt11845260/?ref_=fn_al_nm_1a">The Interview</a> (2019))</small></td>
I'm thinking very simple out loud right now but would this a option: After the search for a person or movie the url is known, correct?
$pathFragments = explode('/', $url);
$id = filter_var($pathFragments[2], FILTER_SANITIZE_NUMBER_INT);
For the small tag i have no clue jet.. Edit: added filter_var
I'm not a regex master, but i tried with \d+?
and it worked. I wrote an additional test case when searching for John Zuberek
and opened a PR ( #203 ).
Description
When searching for a person by name, the ID which is extracted by the regex is always 7 numbers long, even if the result returns a URL with 8 numbers.
It seems to me that the regex
nm(\d{7,8})
matches 7 numbers and then moves along, skipping the last character.Would
(\d{8}|\d{7}
work instead? Happy to put together a pull request if that helps.Type
Bug
Expected Results / What do you want to do?
It should return an ID of length 8 if the id is actually that long