nolanw / HTMLReader

A WHATWG-compliant HTML parser in Objective-C.
Other
802 stars 70 forks source link

firstNodeMatchingSelector return nil when looking for node which exist #61

Open kivu opened 8 years ago

kivu commented 8 years ago

Hi I'm trying to parse some HTML document to get two texts from tags: "Some text to display.image_name_to_display.jpg" so I use this code : HTMLDocument document = [HTMLDocument documentWithString:self.content]; //content is html above NSString handAndImageStr = [document firstNodeMatchingSelector:@"hand"].textContent; if (handAndImageStr) { NSString *imgStr = [document firstNodeMatchingSelector:@"image"].textContent;

and then imgStr is null instead of "image_name_to_display.jpg"

I'm using HTMLReader 0.9.4

nolanw commented 8 years ago

Hello! Did your HTML make it into the issue intact? If not, try surrounding it in backticks or triple-backticks to preserve the formatting.

Otherwise I'm left to guess at what's going on. Is it possible that it's an img element, not an image element, that you're looking for? And <img> doesn't generally have any text content, so I'm suspicious of that too. Is it possible you're looking to get at the src attribute? i.e. [document firstNodeMatchingSelector:@"img"][@"src"].

Let me know if any of that is helpful, or if I've misunderstood the HTML you're trying to scrape!

kivu commented 8 years ago

ahh Sorry I didn't notice that my tags are gone ;/ this is my hmtl with tags: <hand>Text to display <image>image to display.jpg</image></hand>

I tried to use [document firstNodeMatchingSelector:@"image"][@"src"] or [document firstNodeMatchingSelector:@"img"][@"src"] but this won't work

nolanw commented 8 years ago

I'm still a bit suspicious that your text is actually HTML. Is it possible it's actually XML, or something else entirely?

If you put <image> into an HTML document, it'll get parsed as if you put <img>. You'll probably notice that [document firstNodeMatchingSelector:@"image"] returns nil. This is why.

Additionally, <img> elements simply aren't allowed to have text or child elements or anything like that. Anything you try to put inside <img> get moved outside of it.

Putting the above two points together, If your document looks (in part) like this:

Ahoy <img>there</img> sailor

it actually gets parsed as if you wrote this:

Ahoy <img />there sailor

(See how the "there" popped out of the <img>?) Unfortunately, if you just want the text that looked like it was between <image> and </image>, you probably can't do it reliably.

I hope that all made sense, I realize it's pretty confusing. Can you share the full document you're trying to parse (obfuscating any private data of course)? Maybe I can think of a more suitable tool.

kivu commented 8 years ago

Thanks for your fast answer I had to check what exactly app received from server and you were right this is not a html :( app received a dictionary with some xml text:

{
            elements =             (
                                {
                    text = "<hand>some text to display <image> file_name.jpg </image></hand>";
                }
            );
            id = "204";
            time = "2016-02-11 12:15:00";
            timeSort = "2016-02-11 12:15:00";
        },

and made some parse to get info from tags "hand" and "image" as I checked what might be issue of this I just found that when app use old version 0.5.9 somehow the text from "image" tag was parsed without problems. I will try to get this image name out in some other way

nolanw commented 8 years ago

It looks like XML to me, so you could try using NSXMLParser (built in to iOS and OS X) or a library like KissXML (there are many, many XML libraries for iOS and OS X, that's just one I looked up).

Are you saying the current version of HTMLReader parses that text differently from version 0.5.9? If so I should take a look at that, there might be a bug there.