rchipka / node-osmosis

Web scraper for NodeJS
4.12k stars 246 forks source link

content of xml <link> tag not extracted? #252

Open anneb opened 4 years ago

anneb commented 4 years ago

This RSS Feedor((publicationName==%22Staatsblad%22))or((publicationName==%22Staatscourant%22))or((publicationName==%22Gemeenteblad%22))or((publicationName==%22Provinciaal%20blad%22))or((publicationName==%22Waterschapsblad%22))or((publicationName==%22Blad%20gemeenschappelijke%20regeling%22)))) returns a set of <item></item> elements. Every <item> element has subelements like <title>, <description> and <link>

I am able to extract the content of <title> and <description> but not from <link>

Am I missing something? Is this a feature or a bug? <link> is also a valid HTML element. Maybe osmosis is confusing xml with html?

I am using the following code:

const osmosis = require('osmosis');
const baseUrl = 'https://zoek.officielebekendmakingen.nl/rss?q=(available%3e=2020-01-14%20and%20available%3c=2020-01-14)and(((publicationName==%22Tractatenblad%22))or((publicationName==%22Staatsblad%22))or((publicationName==%22Staatscourant%22))or((publicationName==%22Gemeenteblad%22))or((publicationName==%22Provinciaal%20blad%22))or((publicationName==%22Waterschapsblad%22))or((publicationName==%22Blad%20gemeenschappelijke%20regeling%22)))';

let result = [];

osmosis
    .get(baseUrl)
    .find('item')
    .set({
        link: 'link',
        title: 'title',
        summary: 'description'
    })
    .data(res=>result.push(res))
    .done(()=>console.log(result));

The output of the above code shows titles and summaries but all links are empty.

anneb commented 4 years ago

This is likely a duplicate of (currently still open) https://github.com/rchipka/node-osmosis/issues/204