scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
847 stars 113 forks source link

Adding twitter tags #179

Open platelminto opened 3 years ago

platelminto commented 3 years ago

Would there be any interest in adding twitter card tags (detailed here)? I'd be willing to work on this if there's any interest and submit a pull request.

lopuhin commented 3 years ago

hi @platelminto that could be a great addition. Do you have some examples of the pages with this markup, and example outputs? Do you have an estimate of how popular is this markup?

platelminto commented 3 years ago

Example pages:

https://shop-eu.kurzgesagt.org/ with markup:

<meta name="twitter:site" content="@kurz_gesagt">

  <meta name="twitter:card" content="summary">

  <meta name="twitter:title" content="kurzgesagt shop">
  <meta name="twitter:description" content="The official kurzgesagt online shop. Merch created with love. Posters, notebooks, clothes, plushies and more from the kurzgesagt universe.">

    <meta name="twitter:image" content="https://cdn.shopify.com/s/files/1/0252/6822/4088/t/64/assets/logo_twitter.png?v=14636856715189202634" />

https://store.taylorswift.com/products/i-would-die-for-you-in-secret-hoodie with markup:

<meta name="twitter:card" content="summary"><meta name="twitter:title" content="“i would die for you in secret” hoodie">
  <meta name="twitter:description" content="FOLKLORE ALBUM COLLECTION

*please note we are doing our best to deliver your order as fast as possible, however, we may experience delays somewhere along the way as we try to keep everyone safe.Black hooded sweatshirt featuring &quot;folklore album&quot; printed in copper glitter on front and photo of Taylor Swift printed on back along with &quot;All these people think love&#39;s for show, but I would die for you in secret&quot; lyrics in copper glitter. 
100% cottondepiction of this product is a digital rendering and for illustrative purposes only. actual product detailing may vary.Taylor Swift®©2021 TAS Rights Management, LLCUsed By Permission. All Rights Reserved. 
">
  <meta name="twitter:image" content="https://cdn.shopify.com/s/files/1/0011/4651/9637/products/dieforyouhoodiefront_600x600_crop_center.png?v=1627046674">

https://github.com/ with markup:

    <meta property="twitter:site" content="github">
    <meta property="twitter:site:id" content="13334762">
    <meta property="twitter:creator" content="github">
    <meta property="twitter:creator:id" content="13334762">
    <meta property="twitter:card" content="summary_large_image">
    <meta property="twitter:title" content="GitHub">
    <meta property="twitter:description" content="GitHub is where people build software. More than 65 million people use GitHub to discover, fork, and contribute to over 200 million projects.">
    <meta property="twitter:image:src" content="https://github.githubassets.com/images/modules/open_graph/github-logo.png">
    <meta property="twitter:image:width" content="1200">
    <meta property="twitter:image:height" content="1200">

https://www.billboard.com/ with markup:

<meta data-rh="true" name="twitter:site" content="@billboard" />
<meta data-rh="true" property="og:site_name" content="Billboard" />
<meta data-rh="true" property="og:url" content="https://www.billboard.com/" />
<meta data-rh="true" name="og:image" property="og:image" content="https://static.billboard.com/files/2019/07/billboard-logo-b-20-billboard-1548-1092x722-1598619661-compressed.jpg" />
<meta data-rh="true" name="og:image:width" property="og:image:width" content="1092" />
<meta data-rh="true" name="og:image:height" property="og:image:height" content="722" />
<meta data-rh="true" name="og:title" property="og:title" content="Billboard - Music Charts, News, Photos &amp; Video" />
<meta data-rh="true" name="twitter:title" property="twitter:title" content="Billboard - Music Charts, News, Photos &amp; Video" />

These include a couple I just found now randomly, it looks extremely popular (to the extent of opengraph and json+ld).

lopuhin commented 3 years ago

Thanks for examples, looks quite popular indeed, +1 that it's useful.

Also it seems that this is already somewhat supported, for example this works

>>> extruct.extract('<!doctype html><html><head><meta property="twitter:card" content="summary">')
{'microdata': [],
 'json-ld': [],
 'opengraph': [],
 'microformat': [],
 'rdfa': [{'@id': '',
   'https://dev.twitter.com/cards#card': [{'@value': 'summary'}]}]}

But not this

>>> extruct.extract('<!doctype html><html><head><meta name="twitter:card" content="summary">')
{'microdata': [],
 'json-ld': [],
 'opengraph': [],
 'microformat': [],
 'rdfa': []}
stephentgrammer commented 2 years ago

👍 @platelminto anything I can do to help this along? Otherwise, I'll be building my own extraction for twitter cards. Unless anyone knows of another package that handles twitter cards already?

blackhat-7 commented 2 years ago

I have added the twitter card functionality in the #196 pull request. Please let me know if this works.