postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.46k stars 446 forks source link

selecting an attribute doesn't seem to work #537

Open thoraxe opened 4 years ago

thoraxe commented 4 years ago

Expected Behavior

https://moneymaven.io/mishtalk/economics/lie-of-the-day-this-is-not-a-pandemic-CdOIoPAmbEyglh3Ls6RXKQ

export const MoneymavenIoExtractor = {
  domain: 'moneymaven.io',

  title: {
    selectors: [
      'article h1'
    ],
  },

  date_published: {
    selectors: [
      ['meta[name="build:date"]', 'content'],
    ],
  },

  content: {
    selectors: [
      'article'
    ],
  },
}

Using [meta[name="build:date"]','content'] should extract the value:

<meta name="build:date" content="2020-02-22 00:49:13 +0000">

Current Behavior

In the test, the value is not extracted:

  ● MoneymavenIoExtractor › initial test case › returns the date_published                                                                                                                                                                    

    AssertionError [ERR_ASSERTION] [ERR_ASSERTION]: null == '2020-02-22 00:49:13 +0000'                                                                                                                                                       

      47 |     // Update these values with the expected values from                                                                                                                                                                           
      48 |     // the article.                                                                                                                                                                                                                
    > 49 |     assert.equal(date_published, '2020-02-22 00:49:13 +0000')                                                                                                                                                                      
         |            ^                                                                                                                                                                                                                       
      50 |   });                                                                                                                                                                                                                              
      51 |                                                                                                                                                                                                                                    
      52 |     it('returns the content', async () => {                                                                                                                                                                                        

      at Object.equal (src/extractors/custom/moneymaven.io/index.test.js:49:12)                                                                                                                                                               
      at tryCatch (node_modules/regenerator-runtime/runtime.js:62:40)                                                                                                                                                                         
      at Generator.invoke [as _invoke] (node_modules/regenerator-runtime/runtime.js:288:22)                                                                                                                                                   
      at Generator.prototype.<computed> [as next] (node_modules/regenerator-runtime/runtime.js:114:21)                                                                                                                                        
      at asyncGeneratorStep (src/extractors/custom/moneymaven.io/index.test.js:17:103)                                                                                                                                                        
      at _next (src/extractors/custom/moneymaven.io/index.test.js:19:194)  

Steps to Reproduce

See above

Detailed Description

Using $$('meta[name="build:date"]'); in the browser finds only one element. It's not clear why the parser isn't picking it up (see NULL in above test output).

Is this user error?

Other

This site is pretty terrible and appears to intentionally leave things unlabeled. I'm not sure I'll ever be able to provide a valid parser that grabs everything for it. I'll probably carry something in a local fork.

I am using mercury via https://github.com/feedbin/extract