tboothman / imdbphp

PHP library for retrieving film and tv information from IMDb
253 stars 84 forks source link

Make budget more reliable #246

Closed duck7000 closed 2 years ago

duck7000 commented 3 years ago

Budget used data from script, this is however not reliable. Changed it to a more reliable method that scrapes the actual page.

I do have however a few thoughts about this method:

duck7000 commented 3 years ago

@jreklund According to the method that is in the source code right now it does return int

    /*
  * Get budget
  * @return int|null null on failure / no data
  * @brief Assuming budget is estimated, and in american dollar
  * @see IMDB page / (TitlePage)
  */
    public function budget()
    {
        if (empty($this->budget)) {
            $query = $this->XmlNextJson()->xpath("//productionBudget/budget/amount");
            if (!empty($query) && isset($query[0])) {
                $this->budget = intval(str_replace(",", "", $query[0]));
            } else {
                return null;
            }
        }
        return $this->budget;
    }

I made it more reliable but your suggesting is perhaps a better option, i'll look in to that. And yes it assumes that budget always the first one, which it to my tests always is if it is available.

duck7000 commented 3 years ago

The json part is unreliable, that is what i have rewritten.

Above are 2 thoughts, i think this method does not return the right data, it returns only int but not the currency, so it is impossible to know what that number stands for?

duck7000 commented 3 years ago

I have changed your suggestion about getting the data, that works great, better than my solution, thanks

I did not adjust the PR jet, the output of this method is in version 6.5 and 6.2.2 also int. Remains the question: int or string return? string would be better that can include the currency.

jreklund commented 3 years ago

Current behavior are indeed returning int or null, but the suggestion solution are string or null. That's what I meant by "this function doesn't return an int anymore". filter_var are also stripping $ and (estimate) if I'm not mistaken.

For version 7; int or null For version 8; What we choose it to be

Only option to include both in 7 are to include parameters so that it can return the currency as well, if selected.

public function budget($currency = false)

Regarding the correctness of currency, we can never grantee it with the suggested change. Json however can:

{"budget":{"amount":93000000,"currency":"USD","__typename":"Money"}

But as all movies are currently displayed in USD, even Sweden produced ones, even if you select Sweden/Swedish in your IMDb profile, it's "safe" to assume all numbers will be USD regardless if we return $ or not.

Have you found a movie/tv-show not having "estimated" under budget?

duck7000 commented 3 years ago

this movie has New Zealand dollars https://www.imdb.com/title/tt0092610/

this one has dutch guilders https://www.imdb.com/title/tt0070842/

this one has Australian dollars: https://www.imdb.com/title/tt0087981/

this one Canadian dollars https://www.imdb.com/title/tt0123755/

I don't even have a imdb profile and i don't use language settings in imdbphp

So yes there are differences, assuming they all are in dollars is wrong, Although most movies will be in dollars i agree

duck7000 commented 3 years ago

No all tested movies have estimated.

Mm intval does roughly the same, returns int according to value of input, That didn't work when i tested it keep getting no data or 0 filter_var does the same and yes filters out $ en estimated to get only int, that is how it always has been.

Well now i'm confused, what is the supposed output of this method? $ is not an int, neither is (estimated) ?

Maybe you are talking about string with numbers versus int? I guess filter_var does return string with numbers? if so then yes technically that is not int. Is this important? I mean the data from the imdb page is technically also string? Hard to understand..

duck7000 commented 3 years ago

this the output from the last movie above First value is coming from this var_dump($budgetListItems[0]->nodeValue); Second value (0) is the actual output from the method, this is with intval. string(22) "CA$365,000 (estimated)" 0

So the actual value that is coming from imdb is string? Converting to int does not work with intval, it does with filter_var but that leaf it as a string i guess

I'm not trying to be a wise guy here as i have a hard time understanding the difference in this case.

the php manual about intval states this: Strings will most likely return 0 although this depends on the leftmost characters of the string. The common rules of integer casting apply. So in this case intval does not seems a good solution?

duck7000 commented 3 years ago

I think i'm getting to the point of understanding..

$this->budget = (int) filter_var($budgetListItems[0]->nodeValue, FILTER_SANITIZE_NUMBER_INT); Cast the output to int, output is now int. This is acceptable i guess?

Remains the currency issue? I agree that if we want to keep backwards compatibility only a parameter is a valid option. In that case check with strpos the position of the first digit and capture all until start of string? Glue this together with the found integers? The result will be a string i guess?

@tboothman what do you think about this?

jreklund commented 3 years ago

As this project are being utilized in many different projects we need to make sure to return the same data as before, so that we don't break any applications. At the moment we aren't forcing the output to be int or null, it's just applied. Changing it to a string containing numbers can break programs in case they don't validate it as an int before saving it.

intval() can only cast numbers, in the previous code we replaced comma with empty character just to make sure it only contained numbers before casting it. As we are getting a formatted string from IMDb we need to remove everything but numbers from it, currently we can use FILTER_SANITIZE_NUMBER_INT as the string only contains numbers for the budget (as of right now).

That function however take every number, + and -, ending up with this (not likely at all, just an example).

$number="NZ$200,000 (estimated) 222";
var_dump(filter_var($number, FILTER_SANITIZE_NUMBER_INT));
string(9) "200000222"

We can also use preg_replace to achieve the same thing (it removes + and - too).

$number="NZ$200,000 (estimated) 222";
var_dump(preg_replace('/[^0-9]/', '', $number));
string(9) "200000222"

In this case I would use regex to grab everything. We get currency, amount and comment that way. After that remove comma and cast it as int just as before.

$re = '/^(?<currency>[^0-9\s]+)?\s*(?<amount>[0-9,]+)\s*(?<comment>.{1,})?$/m';
$str = 'NZ$200,000 (estimated)';
preg_match($re, $str, $matches);
var_dump($matches);

array(7) {
  [0]=>
  string(22) "NZ$200,000 (estimated)"
  ["currency"]=>
  string(3) "NZ$"
  [1]=>
  string(3) "NZ$"
  ["amount"]=>
  string(7) "200,000"
  [2]=>
  string(7) "200,000"
  ["comment"]=>
  string(11) "(estimated)"
  [3]=>
  string(11) "(estimated)"
}

The problem still remain that NZ$ aren't a real currency. It's NZD. So it looks like we need to replace $ with D, and just $ with USD etc. So it's a lot to keep track of, in case we want different output. We also need to write tests for everything, the more that we put into the library, the more stuff can break. :)

duck7000 commented 3 years ago

Clearly explained, thanks!

For now the best is to output int, and leave the rest out. i agree that intval could be better but in this case it didn't work, probably because of the letters in front of the digits?

So what to do? leave it with filter_var or strip letters from beginning of the string and use intval?

jreklund commented 3 years ago

You will need to remove commas as well to be able to use intval. I think filter_val are a neat solution as I can see both filter_val and my regex fail in the future.

The regex code will accept titles without currency and comment as it grabs the first "number" string as the budget, future proofing as much as possible. So if you use that one you will need to strip commas and use intval.

e.g.

$budget = 'NZ$200,000 (estimated)';
if (preg_match('/^(?<currency>[^0-9\s]+)?\s*(?<amount>[0-9,]+)\s*(?<comment>.{1,})?$/', $budget, $match)) {
    if (isset($match['amount']) && !empty($match['amount'])) {
        $this->budget = intval(str_replace(',', '', $match['amount']));
    }
}
duck7000 commented 3 years ago

I think i go with your regex than, although i rather did use a solution without a regex. But in this case that would be very hard, hard to maintain, and not very future proof i have to admit.

duck7000 commented 3 years ago

I'm sorry, I don't know/understand why there is no more reaction on this PR or my other PR's for that matter? So Closing this one too.

duck7000 commented 2 years ago

On request by @Thomasdouscha i will reopen

duck7000 commented 2 years ago

@jreklund Is this PR something you will be continuing?

Thomasdouscha commented 2 years ago

and what about status ?

duck7000 commented 2 years ago

The last status is a discussion about how the budget is returned, in numbers only or leading text like us$ or USD. I still think that this does not cover all cases as there are different currency used, but @jreklund is absolutely right that will alter the output of this method and break backwards compatibility.

The question is what is more important, give back only numbers and forget about currency of give back the right currency with the number

Only option for now seems to be, like @jreklund suggested, is to use a parameter in this method which change the output to numbers with currency or not. the maintainability will be getting harder, like @jreklund already mentioned