matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 349 forks source link

Get <br/> from output #229

Closed alexandre1985 closed 8 years ago

alexandre1985 commented 8 years ago


tags disappear from output

I'm getting some

from a webpage with this command x(page, '#myid', ['p']).... Sometimes the text that I want to show is separated by

and everything is fine. But sometimes the text is separated by
(or can even be
only) and I would want to split those
. The problem is that with my command it doesn't show any
on the output (and those exist); so I can't split that string. Is there anyway to show the
on the output ?

Your environment

var xray = require('x-ray');
var x = xray();
x('http://lifestyle.sapo.pt/astral/previsoes/miguel-de-sousa?signo=carneiro', '#semanal', ['p'])(function(err, data) {
  for (var i = 0; i < data.length; i++) { console.log(data[i]); }
});

Expected behaviour

text
text
text
text

Actual behaviour

texttexttexttext

tokhi commented 8 years ago

You can use filters, its in the documentation.

var Xray = require('x-ray');
var x = Xray({
  filters: {
    clean: function (value){
      value.replace('<br />', '');
       return value;
    }
  }
});

Then you can call it as below:

x('http://mat.io', {
  content: 'content | clean'
})
alexandre1985 commented 8 years ago

@tokhi you didn't understand it right. I don't want to remove the
I want it to be shown! With xray like it is, it doesn't show the
's...

alexandre1985 commented 8 years ago

I'm making some progress. With x(string, '#'+duracao, ['p@html']) I'm able to show the
but the problem now is that the text isn't showing like utf-8. For example: ã is showing like &#xE3;. Any fixes for this problem?

tokhi commented 8 years ago

Add this to your filter then you should be good:

return  JSON.parse( JSON.stringify( value ) ); // This should be able to render the  ã
alexandre1985 commented 8 years ago

Thank you @tokhi . It didn't work. I was able to decode the numeric entities by adding the module 'entities'. I didn't want a new dependency but now it's working

tokhi commented 8 years ago

gr8, plz add your solution to help others and close the ticket if you are satisfied.