Weird issue in rendering HTML

Hey, I'm writing a scraping library, which also leverages selectolax as parser. But the problem is our test failes with this library, due to strange behaviour. perhaps bug? :)

As shown in picture, we're passing that HTML to the parser, but

Why table is empty? we didn't structured tag like that. UL comes in table, but in your code that didn't happen. Not sure why you modify the input
Why there's a tag insert by this library? I could see you have some attr for head, but can't you initialize it as None if there's no head in HTML?

rushter commented 1 year ago

Can you post the original HTML snippet (as text)?

rushter commented 1 year ago

I don't see any problems here (visually, looking at the screenshot). It's not allowed to have ul tags inside table. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.
ManiMozaffar commented 1 year ago
I don't see any problems here (visually, looking at the screenshot). It's not allowed to have ul tags inside table. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.

Hmm I understand what you're saying, but is there a way we can escape it? I'm writing a library, called FastCrawler. What we want to do is to offer a pydantic interface for data models. like
```
from fastcrawler import CssField, SelectoLaxProcessor, BaseModel

class Person(BaseModel):
    name: str = CssField(query="td:nth-child(1)::text" , processor=SelectoLaxProcessor)
    id: int = CssField(query="td:nth-child(2)::text" , processor=SelectoLaxProcessor)

class PersonTable(BaseModel):
    persons: list[Person] = CssField(query="table tr" , processor=SelectoLaxProcessor)
```
so I need to iterate over HTML elements, and parse them accordingly. I wouldn't except the string i'm giving is different than what real there is.
rushter commented 1 year ago

No, the parser works the same way as browsers, so it's not possible, unfortunately. Some old browsers won't even display part of this HTML.
- © Githubissues.
- Githubissues is a development platform for aggregating issues.

rushter / selectolax

Weird issue in rendering HTML #94