laurengarcia / url-metadata

NPM module: Request a url and scrape the metadata from its HTML using Node.js or the browser.
https://www.npmjs.com/package/url-metadata
MIT License
166 stars 43 forks source link

"Fails" with multiple tags of the same name #44

Closed vincerubinetti closed 11 months ago

vincerubinetti commented 3 years ago

There are cases when websites have multiple <meta> tags with the same name attribute:

image

I'm not sure if this is valid/proper html, but I've seen several sites do it, especially in academia.

Since this library returns metadata as an object with the name attributes as the keys, if there are multiple tags with the same name, it only returns the last one. In the above screenshot, it only returns Cheeseman.

Don't know what a good way to handle this would be. Maybe if there are multiple, just return an array of them instead under the same object key?

laurengarcia commented 1 year ago

Sounds like a good idea. Will do.

laurengarcia commented 11 months ago

Just documenting what i found in terms of specs here: https://www.google.com/intl/en/scholar/inclusion.html#indexing

_"The author tag, e.g., citationauthor or DC.creator, must contain the authors (and only the actual authors) of the paper. Don't use it for the author of the website or for contributors other than authors, e.g., thesis advisors. Author names can be listed either as "Smith, John" or as "John Smith". Put each author name in a separate tag and omit all affiliations, degrees, certifications, etc., from this field. At least one author tag is required for inclusion in Google Scholar."

The whatwg HTML spec says there are other kinds of meta tags that should only be included once per page. I think the best course of action here is to filter meta tags that begin with citation_ and return an array if more than one of a type is found, i.e. citation_author would return ["S. Zachary Schwartz", "Tzer Han Tan"...] in your snippet above. But this behavior would be limiited to these citation_ tags.

laurengarcia commented 11 months ago

Now supported in v3.3.0 https://www.npmjs.com/package/url-metadata?activeTab=versions