Closed duck7000 closed 3 years ago
You want to remove all li-tags or do you want to keep some. Take this movie for example, do you want them removed or what's the purpose?
@jreklund Thanks for your comment.
In the movie you refer to those li tags i want to keep, but filter out combinations of "<ul><li>", "</li><li>"
that sometimes occur.
I don't remember in which movie that was though.
array("<ul>", "<li>")
what you suggest will filter out all li and ul tags, then there is no structure anymore and all text will be appended together.
It would be better to remove all html tags but i don't know how that would be accomplished in a way that the text keeps structured? i thought about replace with line breaks but that didn't work out pretty.
This movie has the same problem: https://www.imdb.com/title/tt0094651/
I use php 7.4 on my server and this works without crashing?
Sounds like a movie with an indention, to an indention. If you want to keep all indention then my suggested settings would suffice as you tell strip_tags what html-tags to keep.
Like this one:
strip_tags(string $string, array|string|null $allowed_tags = null): string
Travis only failed on PHP 7.3, PHP 7.4 and PHP 8 worked fine.
Ideally i would have those alternateversions without all html tags, but how can the text keep structured? With a * or # in front of each line?
Maybe i take this back to the drawing board... i rewrote this to xpath, the output is now nodeValue so all html tags are apparently already gone? strip_tags doesn't do anything..
@jreklund
I have something brewed together, not jet ready jet, missing one part.
/**
* Get the Alternate Versions for a given movie
* @return array Alternate Version (array[0..n] of string)
* @see IMDB page /alternateversions
*/
public function alternateVersions()
{
if (empty($this->moviealternateversions)) {
$xpath = $this->getXpathPage("AlternateVersions");
if ($xpath->evaluate("//div[contains(@id,'no_content')]")->count()) {
return array();
}
$cells = $xpath->query("//div[@class=\"soda odd\" or @class=\"soda even\"]");
foreach ($cells as $cell) {
$count = $cell->getElementsByTagName('li')->count();
if ($count) {
// here i need the only the text from div, nothing else.
$items = $cell->getElementsByTagName('li');
foreach ($items as $key => $value) {
$listItems .= '# ' . trim($value->nodeValue);
if ($key < $count - 1) {
$listItems .= ' ';
}
}
$this->moviealternateversions[] = $listItems;
} else {
$this->moviealternateversions[] = trim($cell->nodeValue);
}
}
}
return $this->moviealternateversions;
}
If the alternate version contains list items then i process them separately, add a sign in front (used #) and close with a line break. This will structure the text sort of like a normal list item. Is this a acceptable solution?
The only problem i can't seem to get is the leading text that is in the surrounding div like this line in the movie Amsterdammned: In the television version of the film, scenes delted include: it is included in $cell but i can't find a xpath way to get that part, can you help me with this one?
I found a solution, finally, it works but it feels like a workaround?
travis has a mind of it's own..
Personally I'm more for using dash (–) instead of implementing HTML. And using new linews (\n) instead of no breakable space, if we should manipulate the data. What do @tboothman think about the matter?
@jreklund Thanks for commenting.
I used a bullet sign so it looks like a list item, but any character is fine by me.
line breaks are visual in my output,
character is not, it is shown as space.
And
is not a non breakable space..
https://unicodelookup.com/# /1
New lines (\n) are a special characters and need to be used with double quotes "\n"
instead of '\n'
. Single quotes are just a string and will print everything literally. Double quotes accept variables, special characters etc.
Right... jep you are absolutely right, i didn't know this difference, make sence. in this case it works perfectly. Thanks!
Thanks for the code, looks promising. Will take it for a spin later.
@duck7000 I re-wrote it to support multiple text and list blocks, in case they change the format in the future. Please check that it works out great for you too.
@jreklund Thanks it works great.