Hey,
I'm writing a scraping library, which also leverages selectolax as parser.
But the problem is our test failes with this library, due to strange behaviour. perhaps bug? :)
As shown in picture, we're passing that HTML to the parser, but
Why table is empty? we didn't structured
tag like that. UL comes in table, but in your code that didn't happen. Not sure why you modify the input
Why there's a tag insert by this library? I could see you have some attr for head, but can't you initialize it as None if there's no head in HTML?
I don't see any problems here (visually, looking at the screenshot). It's not allowed to have ul tags inside table. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.
I don't see any problems here (visually, looking at the screenshot). It's not allowed to have ul tags inside table. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.
Hmm I understand what you're saying, but is there a way we can escape it?
I'm writing a library, called FastCrawler.
What we want to do is to offer a pydantic interface for data models. like
from fastcrawler import CssField, SelectoLaxProcessor, BaseModel
class Person(BaseModel):
name: str = CssField(query="td:nth-child(1)::text" , processor=SelectoLaxProcessor)
id: int = CssField(query="td:nth-child(2)::text" , processor=SelectoLaxProcessor)
class PersonTable(BaseModel):
persons: list[Person] = CssField(query="table tr" , processor=SelectoLaxProcessor)
so I need to iterate over HTML elements, and parse them accordingly. I wouldn't except the string i'm giving is different than what real there is.
Hey, I'm writing a scraping library, which also leverages selectolax as parser. But the problem is our test failes with this library, due to strange behaviour. perhaps bug? :)
As shown in picture, we're passing that HTML to the parser, but
Can you post the original HTML snippet (as text)?
I don't see any problems here (visually, looking at the screenshot). It's not allowed to have
ul
tags insidetable
. If you open such HTML in Chrome, it most likely will be transformed in a similar way. That's how malformed HTML is handled by parsers and browsers.Hmm I understand what you're saying, but is there a way we can escape it? I'm writing a library, called FastCrawler. What we want to do is to offer a pydantic interface for data models. like
so I need to iterate over HTML elements, and parse them accordingly. I wouldn't except the string i'm giving is different than what real there is.
No, the parser works the same way as browsers, so it's not possible, unfortunately. Some old browsers won't even display part of this HTML.