postlight / parser

📜 Extract meaningful content from the chaos of a web page
https://reader.postlight.com
Apache License 2.0
5.46k stars 446 forks source link

author tranforms #502

Closed nitinthewiz closed 5 years ago

nitinthewiz commented 5 years ago

I'm trying to build a custom parser for this news site - https://timesofindia.indiatimes.com/india/china-snubs-imran-says-resolve-jk-bilaterally/articleshow/71496416.cms

The author byline section has the date with it, so I thought of using the following transform -

author: {
    selectors: ['div.byline'],
    transforms: {
      'div.byline': function getAuthor($node) {
        byline = $node.text();
        return byline.split('|')[0];
      }
    },
  },

But the split doesn't seem to work. I'm wondering if author even supports transforms. I have noticed that clean is used in some custom parser, but I don't know if transforms are available to author.

Code from Master branch, working on OSX.

In the browser, for the fixture, I can do the following -

$('div.byline').innerText.split('|')[0].trim()

and it seems to work. So just curious.

nitinthewiz commented 5 years ago

Alternatively, I've realized that the GenericAuthorExtractor does an oddly good job of extracting the name, but I've not found even a single example where it was used as part of the customExtractor to do a sort of mix-and-match where parts of the extractor need to be customized and parts do not.

Is that even possible? Is it possible for me to say -

  title: {
    selectors: ['h1'],
  },
  author: {
    selectors: GenericAuthorExtractor
  },

Update: I see fallback is an option in the test.js files. Perhaps that's the answer to my woes. 🙄

nitinthewiz commented 5 years ago

Closing as primary issue is resolved, though I'd still like to know if author can have transforms. I noticed that format and timezone were added to date_published.