This is an extractor for arstechnica.com. A few notes:
I removed the contentOnly: true option from extractorOpts in collect-all-pages.js because it resulted in next_page_url always being null on the second page of an article.
Articles from this site are often paginated, but I was unable to write a CSS selector to find the next page. On the last page, there will be a link with a CSS selector indicating that the previous page is next. But the parser appears to find the next page without this extractor finding it, as long as the fallback option is left at its default value of true.
This is an extractor for arstechnica.com. A few notes:
I removed the
contentOnly: true
option fromextractorOpts
incollect-all-pages.js
because it resulted innext_page_url
always being null on the second page of an article.Articles from this site are often paginated, but I was unable to write a CSS selector to find the next page. On the last page, there will be a link with a CSS selector indicating that the previous page is next. But the parser appears to find the next page without this extractor finding it, as long as the
fallback
option is left at its default value oftrue
.