rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.11k stars 68 forks source link

Weird issue in rendering HTML #94

Closed ManiMozaffar closed 1 year ago

ManiMozaffar commented 1 year ago

Hey, I'm writing a scraping library, which also leverages selectolax as parser. But the problem is our test failes with this library, due to strange behaviour. perhaps bug? :)

As shown in picture, we're passing that HTML to the parser, but

  1. Why table is empty? we didn't structured tag like that. UL comes in table, but in your code that didn't happen. Not sure why you modify the input
  2. Why there's a tag insert by this library? I could see you have some attr for head, but can't you initialize it as None if there's no head in HTML?
  3. Screenshot 2023-07-13 at 19 31 27Screenshot 2023-07-13 at 19 31 04
    rushter commented 1 year ago

    Can you post the original HTML snippet (as text)?

    rushter commented 1 year ago

    I don't see any problems here (visually, looking at the screenshot). It's not allowed to have ul tags inside table. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.

    ManiMozaffar commented 1 year ago

    I don't see any problems here (visually, looking at the screenshot). It's not allowed to have ul tags inside table. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.

    Hmm I understand what you're saying, but is there a way we can escape it? I'm writing a library, called FastCrawler. What we want to do is to offer a pydantic interface for data models. like

    from fastcrawler import CssField, SelectoLaxProcessor, BaseModel
    
    class Person(BaseModel):
        name: str = CssField(query="td:nth-child(1)::text" , processor=SelectoLaxProcessor)
        id: int = CssField(query="td:nth-child(2)::text" , processor=SelectoLaxProcessor)
    
    class PersonTable(BaseModel):
        persons: list[Person] = CssField(query="table tr" , processor=SelectoLaxProcessor)

    so I need to iterate over HTML elements, and parse them accordingly. I wouldn't except the string i'm giving is different than what real there is.

    rushter commented 1 year ago

    No, the parser works the same way as browsers, so it's not possible, unfortunately. Some old browsers won't even display part of this HTML.