postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.4k stars 442 forks source link

How to `clone` the `video` portion of the HTML page in order to extract and keep it intact? #615

Open raphael10-collab opened 3 years ago

raphael10-collab commented 3 years ago

How to clone the video portion of the HTML page in order to extract and keep it intact?

For example: From this url : https://abcnews.go.com/Politics/arizona-gov-doug-ducey-signs-law-purge-voters/story?id=77606533&cid=clicksource_4380645_1_heads_hero_live_hero_image

I would like to keep the video streaming.

I tried to modify the abcnew.go.com extractor in this way:

export const AbcnewsGoComExtractor = {
  domain: 'abcnews.go.com',

  title: {
    selectors: ['.article-header h1'],
  },

  author: {
    selectors: ['.authors'],
    clean: ['.author-overlay', '.by-text'],
  },

  date_published: {
    selectors: ['.timestamp'],
    timezone: 'America/New_York',
  },

  lead_image_url: {
    selectors: [['meta[name="og:image"]', 'value']],
  },

  video: {
    selectors: [
      'inline-video-wrapper',
      'video',
    ]
  },

  content: {
    defaultCleaner: false,

    selectors: [
      '.article-copy',
      '#player-api',
      'inline-video-wrapper',
      'video',
    ],
    // Is there anything that is in the result that shouldn't be?
    // The clean selectors will remove anything that matches from
    // the result
    clean: [],
  },
};

But this is the output:

image

I also tried in this way, but it doesn't work:

      'div.inline-content': $node => {
        if ($node.has('img,iframe,video').length > 0) {
           return $node;
        }
      },

How to clone the video portion of the HTML page in order to extract and keep it intact?

OS: Ubuntu 18.04

Uki19 commented 8 months ago

Are there maybe any updates regarding this?