jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 250 forks source link

Store parser position information in dom nodes #159

Open ejsmith opened 10 years ago

ejsmith commented 10 years ago

I have an application which requires knowledge of where nodes start and end relative to the source HTML content. I know that adding this information would be a bit of a memory hit to the DOM structure, but it could also be pretty valuable as well. Any chance you would consider adding this information?

jamietre commented 10 years ago

There was another request for a similar feature, and I shot it down on the basis of resource use. Adding a reference to each node is 8 bytes per which can make a big difference on large structures (or more commonly high-volume situations as are not uncommon in web scraping applications). So I've tried to keep the footprint of the node as minimalist as possible.

However, there is no reason that you couldn't create a structure that inherits from the core CsQuery structures DomObject & DomElement. I haven't actually looked at the DOM code in a long time so it's possible it could be difficult to do this, but the HTML parser itself is completely DOM agnostic and it would be fairly straightforward to implement a tree builder using any types of nodes you like. If they inherit from the core CsQuery structures then it should work just fine in CsQuery as well. The major caveat here is that you can't simply create something that implements the interfaces; you will actually need to inherit from DomObject & DomElement since there is tight coupling to these classes in the code. But everything inherits from DomObject; if you use these as base classes and the appropriate interfaces for other types of DOM nodes like text it should work fine.