tboothman / imdbphp

PHP library for retrieving film and tv information from IMDb
253 stars 84 forks source link

rewrite alternateversions #228

Closed duck7000 closed 3 years ago

duck7000 commented 3 years ago
jreklund commented 3 years ago

You want to remove all li-tags or do you want to keep some. Take this movie for example, do you want them removed or what's the purpose?

https://www.imdb.com/title/tt0120737/alternateversions

duck7000 commented 3 years ago

@jreklund Thanks for your comment.

In the movie you refer to those li tags i want to keep, but filter out combinations of "<ul><li>", "</li><li>"that sometimes occur. I don't remember in which movie that was though. array("<ul>", "<li>") what you suggest will filter out all li and ul tags, then there is no structure anymore and all text will be appended together.

It would be better to remove all html tags but i don't know how that would be accomplished in a way that the text keeps structured? i thought about replace with line breaks but that didn't work out pretty.

This movie has the same problem: https://www.imdb.com/title/tt0094651/

I use php 7.4 on my server and this works without crashing?

jreklund commented 3 years ago

Sounds like a movie with an indention, to an indention. If you want to keep all indention then my suggested settings would suffice as you tell strip_tags what html-tags to keep.

Like this one:

strip_tags(string $string, array|string|null $allowed_tags = null): string

Travis only failed on PHP 7.3, PHP 7.4 and PHP 8 worked fine.

duck7000 commented 3 years ago

Ideally i would have those alternateversions without all html tags, but how can the text keep structured? With a * or # in front of each line?

Maybe i take this back to the drawing board... i rewrote this to xpath, the output is now nodeValue so all html tags are apparently already gone? strip_tags doesn't do anything..

duck7000 commented 3 years ago

@jreklund

I have something brewed together, not jet ready jet, missing one part.

    /**
     * Get the Alternate Versions for a given movie
     * @return array Alternate Version (array[0..n] of string)
     * @see IMDB page /alternateversions
     */
    public function alternateVersions()
    {
        if (empty($this->moviealternateversions)) {
            $xpath = $this->getXpathPage("AlternateVersions");
            if ($xpath->evaluate("//div[contains(@id,'no_content')]")->count()) {
                return array();
            }
            $cells = $xpath->query("//div[@class=\"soda odd\" or @class=\"soda even\"]");
            foreach ($cells as $cell) {
                $count = $cell->getElementsByTagName('li')->count();
                if ($count) {
                    // here i need the only the text from div, nothing else.
                    $items = $cell->getElementsByTagName('li');
                    foreach ($items as $key => $value) {
                        $listItems .= '# ' . trim($value->nodeValue);
                        if ($key < $count - 1) {
                            $listItems .= '&#10;';
                        }
                    }
                    $this->moviealternateversions[] = $listItems;
                } else {
                    $this->moviealternateversions[] = trim($cell->nodeValue);
                }
            }
        }
        return $this->moviealternateversions;
    }

If the alternate version contains list items then i process them separately, add a sign in front (used #) and close with a line break. This will structure the text sort of like a normal list item. Is this a acceptable solution?

The only problem i can't seem to get is the leading text that is in the surrounding div like this line in the movie Amsterdammned: In the television version of the film, scenes delted include: it is included in $cell but i can't find a xpath way to get that part, can you help me with this one?

duck7000 commented 3 years ago

I found a solution, finally, it works but it feels like a workaround?

travis has a mind of it's own..

jreklund commented 3 years ago

Personally I'm more for using dash (–) instead of implementing HTML. And using new linews (\n) instead of no breakable space, if we should manipulate the data. What do @tboothman think about the matter?

duck7000 commented 3 years ago

@jreklund Thanks for commenting.

I used a bullet sign so it looks like a list item, but any character is fine by me.

duck7000 commented 3 years ago

line breaks are visual in my output, &#10; character is not, it is shown as space. And &#10; is not a non breakable space.. https://unicodelookup.com/#&#10;/1

jreklund commented 3 years ago

New lines (\n) are a special characters and need to be used with double quotes "\n" instead of '\n'. Single quotes are just a string and will print everything literally. Double quotes accept variables, special characters etc.

duck7000 commented 3 years ago

Right... jep you are absolutely right, i didn't know this difference, make sence. in this case it works perfectly. Thanks!

jreklund commented 3 years ago

Thanks for the code, looks promising. Will take it for a spin later.

jreklund commented 3 years ago

@duck7000 I re-wrote it to support multiple text and list blocks, in case they change the format in the future. Please check that it works out great for you too.

duck7000 commented 3 years ago

@jreklund Thanks it works great.