html-to-text / node-html-to-text

Advanced html to text converter
Other
1.61k stars 223 forks source link

Can you define options for a baseElement selector? #281

Closed pgoldweic closed 1 year ago

pgoldweic commented 1 year ago

I am trying to retrieve text for a specific selector ONLY, with specific options for such selector. I've tried doing the following:

 let text = convert(html, {baseElements: { selectors: ['h1.PageTitle']}} )

which works correctly but does NOT have any options for how I want to the title to show up. When I try instead:

let text = convert(html, {baseElements: { selectors: ['h1.PageTitle'] }, 
        selectors: [ {selector: 'h1.PageTitle', format: 'block', options: { uppercase: false} } ] })

I get the text for the whole document and NOT just for the baseElement selectors. What am I doing wrong? Or, is there no way to specify formatting for the base element selectors?

KillyMXI commented 1 year ago

I checked it to make sure, and I can't reproduce the issue.

{
  baseElements: { selectors: ['div.foo'] },
  selectors: [
    { selector: 'div.foo', format: 'blockTag', options: { leadingLineBreaks: 5 } }
  ]
}

-- this works just fine in my experiments, elements are selected and formatted accordingly.

That's how it works in the code. Selected base elements are processed by the same rules as any children elements.

I can't see typos in your second example (block formatter doesn't have anything to do with uppercase option but that's irrelevant to the described issue). Make sure you are running what you think you are running.

pgoldweic commented 1 year ago

Thanks @KillyMXI for your prompt response! However, I continue to see the totality of the text in my tests... this is very odd. I have double checked to ensure that my syntax is correct and haven't found anything wrong yet. I've also changed to using a 'heading' format instead of 'block' to see if that causes any changes, but the output hasn't changed. Let me know if you have any other ideas. Thanks!

KillyMXI commented 1 year ago

I don't have enough information to even guess.

How do you run your code? If in Node.js, then what Node version is it? Are you using html-to-text version 9.0.3? Is there any chance you're editing one file but testing another? Are you preprocessing your html in any way before converting?

Try to make an isolated example. (npm init a separate package, npm i html-to-text, in the index.js do just the conversion, similar to the example, just with your html and options. Run it with node ./index.js) Does the issue persist this way? If yes, then I'd like to take a look at the reproduction example (code and html). If no, then you'd have to keep narrowing on the cause of the issue in your pipeline differences.

pgoldweic commented 1 year ago

ok @KillyMXI , I think I figured out how to resolve the problem, although I'm not sure I can explain it myself (most likely I misunderstood the use of the configuration instructions for better performance - that is the 'compile' option). This morning I had changed in my script the line that read:

const { convert } = require('html-to-text')

and changed it with:

const { compile } = require('html-to-text')
const convert = compile({
    wordwrap: 130
})

and then used 'convert' just like I was using it before the change. However, this caused the code to break as I described earlier. When I changed it back to using the original configuration for 'convert', it started working again. From here I conclude that the 'compile' configuration is likely not appropriate for regular use.

KillyMXI commented 1 year ago
const { compile } = require('html-to-text')

const convert = compile({ ...options }) // options here

const text = convert(html) // no options here

-- this convert is different - it already has options in it. You can't add more options later when you call it. It is recommended when you have to process many documents with the same options.

Perhaps I can improve the documentation a bit to make the difference clearer.

pgoldweic commented 1 year ago

That sounds like a good idea @KillyMXI . Thanks for your explanation!

KillyMXI commented 1 year ago

I updated readme a bit. That will hopefully reduce the chance of such confusion.

Documentation is due for a rework. I'm not paying a lot of attention to it currently, before I will get to properly organizing it.