postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

Feature proposal - support Microdata #276

Open vhfmag opened 5 years ago

vhfmag commented 5 years ago

I tested mercury-parser against my website out of curiosity and found out that it doesn't currently support extracting at least author and datePublished from Microdata. I believe this feature could improve this tool's reach and it shouldn't be hard to provide an initial support, since extractors already do rely on selectors to extract metadata:

I say initial support because, for example, a page could have an article and multiple comments, and those could also include metadata. In that case, multiple authors and publish dates would be found and an heuristic would be needed.

It seems to me that the current heuristic for selectors is to use the first matched element (correct me if I'm wrong). If so, this approach seems fine to work fine. If a stricter version is desired, there are libraries that extract Microdata, like microdata-node, that could be used to query for the main content's author and publish date, among other information.

If this feature seems desirable for the project, I would like to work on a PR.

adampash commented 5 years ago

@vhfmag This is an interesting thought. For some context, Mercury is focused primarily on extracting metadata from articles on the web, and while we're not against supporting more content, that's certainly the main focus for now.

With that in mind, I'm completely in favor of supporting microdata for the existing generic parsers where it would make sense.

I'd also prefer doing this extraction without added dependencies, so I'd rather see it incorporated in Mercury's existing extractors vs. pulling in an external dependency like microdata-node.

So yes, if you're interested, please go ahead and work up a PR, keeping in mind that tests are paramount to getting the PR accepted. :smile:

vhfmag commented 5 years ago

So yes, if you're interested, please go ahead and work up a PR, keeping in mind that tests are paramount to getting the PR accepted

@adampash awesome! I'll be working on it :smile:

For some context, Mercury is focused primarily on extracting metadata from articles on the web, and while we're not against supporting more content, that's certainly the main focus for now.

I'll stick to adding Microdata support to existing generic extractors, so that I keep it about extracting article metada