tboothman / imdbphp

PHP library for retrieving film and tv information from IMDb
252 stars 85 forks source link

Method quotes returns nothing #38

Closed duck7000 closed 8 years ago

duck7000 commented 8 years ago

This method is broken due to site changes i guess

page source looks like this:

<div id="qt0396883" class="quote soda sodavote odd" ><div class="sodatext">
<p>
<a href="/name/nm0000131/?ref_=tt_trv_qu"
><span class="character">Mike Enslin</span></a>:
[<span class="fine">Olin gives Enslin the room key</span>]
Most hotels have switched to magnetics. An actual key. That's a nice touch, it's antiquey. </p>

the preg_match like this

preg_match_all('!class="quote soda (odd|even)"\s*><p>\s*(.*?)\s*</p>\s*<div class=!ims',str_replace("\n"," ",$this->page["Quotes"]),$matches)

I tried to fix it myself but my skills with preg_match are slim. obviously the class name is wrong but that is as far as i get

Thanks Ed

IzzySoft commented 8 years ago

From what you've quoted, try:

preg_match_all('!<div class="sodatext">\s*<p>\s*(.*?)\s*</p>!ims',str_replace("\n"," ",$this->page["Quotes"]),$matches)

Though that might give you only the first paragraph of each quote (your quoted code doesn't show the "next element"). If you don't mind having the <p> within:

preg_match_all('!<div class="sodatext">\s*(.*?)\s*</div>!ims',str_replace("\n"," ",$this->page["Quotes"]),$matches)

might be a better fit. (not verified, just inspired by the source given in your question)

duck7000 commented 8 years ago

Thanks for the quick reply

I tested with this movie: http://www.imdb.com/title/tt0450385/quotes?ref_=ttalt_ql_4 I'm sorry if i'm not detailed enough but here is the whole code

<div id="qt0396883" class="quote soda sodavote odd" ><div class="sodatext">
<p>
<a href="/name/nm0000131/?ref_=tt_trv_qu"
><span class="character">Mike Enslin</span></a>:
[<span class="fine">Olin gives Enslin the room key</span>]
Most hotels have switched to magnetics. An actual key. That's a nice touch, it's antiquey. </p><p>
<a href="/name/nm0000168/?ref_=tt_trv_qu"
><span class="character">Gerald Olin</span></a>:
We have magnetic cards also, but electronics don't seem to work in 1408. Hope you don't have a pacemaker. </p><p>
<a href="/name/nm0000131/?ref_=tt_trv_qu"
><span class="character">Mike Enslin</span></a>:
[<span class="fine">into his tape recorder</span>]
General manager claims that the phantom in room interferes... </p><p>
<a href="/name/nm0000168/?ref_=tt_trv_qu"
><span class="character">Gerald Olin</span></a>:
I have *never* used the word "phantom." </p><p>
<a href="/name/nm0000131/?ref_=tt_trv_qu"
><span class="character">Mike Enslin</span></a>:
Oh, I'm sorry. Uh, spirit? Specter? </p><p>
<a href="/name/nm0000168/?ref_=tt_trv_qu"
><span class="character">Gerald Olin</span></a>:
No, you misunderstand. Whatever's in 1408 is nothing like that. </p><p>
<a href="/name/nm0000131/?ref_=tt_trv_qu"
><span class="character">Mike Enslin</span></a>:
Then what is it? </p><p>
<a href="/name/nm0000168/?ref_=tt_trv_qu"
><span class="character">Gerald Olin</span></a>:
It's an evil fucking room. </p></div>

I will try your second solution, if that works fine i strip those p tags off in my program Thanks Ed

duck7000 commented 8 years ago

both preg_match_all works, but give a foreach error invalid argument. I changed the foreach to index 1 and that works fine!

however the method is replacing line breaks with space so all line breaks within each paragraph are also gone..

IzzySoft commented 8 years ago

New lines are just for code formatting here. Especially with the second variant, where you get full HTML returned (including <P>aragraphs, the \newline characters should not matter.

It might just be a quick "work-around" for you. I no longer maintain the code (I did for 12 years, but a while ago Tom has taken over; currently I'm not even use IMDBPHP myself) – so for the final fix, better let @tboothman (Tom) decide.

duck7000 commented 8 years ago

Oke fair enough i'll wait for tbootman to reply

ps the first preg_match_all does indeed capture only the first paragrah, the second all inside the div for now i use the second one as a work arround

Thanks Ed

IzzySoft commented 8 years ago

Ah, it's Ed :) Yepp, I thought so when setting up the "quick-fix regexes". Tom will do the fine-tuning and provide a real solution – while you (and others finding this issue here) at least have a work-around until then.

tboothman commented 8 years ago

Ahh, I see how it is. You get the glory and I write the tests :P

tboothman commented 8 years ago

That output is really dirty. Really each quote should be split into lines and which character said them and who the actor is. I'm not particularly interested in doing the grunt work on that though and it'd change the return signature.

A 'quote' from the matrix

<p>
<a href="/name/nm0000206/?ref_=tt_trv_qu"
><span class="character">Neo</span></a>:
I thought it wasn't real </p><p>
<a href="/name/nm0000401/?ref_=tt_trv_qu"
><span class="character">Morpheus</span></a>:
Your mind makes it real </p><p>
<a href="/name/nm0000206/?ref_=tt_trv_qu"
><span class="character">Neo</span></a>:
If you're killed in the matrix, you die here? </p><p>
<a href="/name/nm0000401/?ref_=tt_trv_qu"
><span class="character">Morpheus</span></a>:
The body cannot live without the mind </p>
IzzySoft commented 8 years ago

Ahh, I see how it is. You get the glory and I write the tests :P

If you insist, I agree with that :smiley_cat:

Honestly: I just wanted to provide a "quick work-around", giving you time for a good solution. What I wrote is just a quick-and-dirty hack, and certainly needs some "fine-tuning" at least (and rather a lot of the latter, I'm afraid) …

duck7000 commented 8 years ago

So this is the solution for now? I agree that the output is rough though..

i think that , if it's possible, to somehow capture all paragraphs in the div's would be a nice solution? And then split the capture in name, imdbid and comment?

I understand that would be a lot of more work though

IzzySoft commented 8 years ago

@duck7000 as I understand Tom, it means he's still working on a "good solution" and considers mine a "usable work-around, not clean enough to be accepted as solution" – as I do. I fully trust him to come up with something much better and cleaner than my "quick hack", which certainly will have a "Wow!" effect. I know why I handed over the project to Tom: he's at least as perfectionist as I am, so no worries :stuck_out_tongue_winking_eye:

Give him a little time, though – IMDB is tricky, and all of us want something that stays working a little longer. Perfectionists' work takes a bit more – but then it's, well, perfect :smiley_cat:

tboothman commented 8 years ago

Well ... I've 'fixed' the method back to how it was before. I don't really think what it's producing is particularly good but that is what the method says it does.

If you want to make a method that does something more useful I wouldn't be opposed to writing some tests for it. I think I'd recommend using xmldom / xpath for this one though because it's quite nicely structured. http://www.imdb.com/title/tt0133093/quotes has quite a few variations including stage directions as its own line.

duck7000 commented 8 years ago

Oke fair enough, the method now works like it always did i guess, although i never used it before.. I'm upgrading my program so i thought it would be nice to get as much info from imdb as possible.

So i accept that you both wouldn't spend the time to make it better Thanks for the fix and input