taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Quotes around attributes are stripped #17

Closed derekbwarren closed 3 years ago

derekbwarren commented 4 years ago

After using setAttribute to update the href of a link the double quotes are stripped from the attributes when converting it back to a string using toString.

Example:

Original: <a style="width:100%;" href="https://google.com" class="button--action button-text-wrap" target="_blank" rel="noopener noreferrer">Button</a>

New: <a style=width:100%; href=http://localhost:8080/metrics/link?redirect=https://google.com&#x26;subject=TEST%20-%20TEST%20-%20test%20link&#x26;name=Button class=ods-button--action button-text-wrap target=_blank rel=noopener noreferrer>Button</a>

Notice the double quotes around the attributes are now gone.

taoqf commented 4 years ago

Thanks. I've made some change, please let me know if it still bothers you.

taoqf commented 4 years ago

Thanks. I've made some change, please let me know if it still bothers you.

derekbwarren commented 4 years ago

Thanks @taoqf!

Sharcoux commented 4 years ago

That was my bad, I forgot to put back the quotes in the raw strings, sorry. But @taoqf isn't your fix a little overcomplicating things? Or did I miss something?

Moreover, with your approach, references made on attributes by an end-user won't be kept updated after using setAttribute. There is also the issue with encoding: if you decode attributes from raw string, you should probably encode them when you write them back don't you think?

Sharcoux commented 4 years ago

My tests were bad, sorry. I'll write a better one if you want.

taoqf commented 4 years ago
  1. Both attribute and rawAttribute would be changed after setAttribute(s). You may check the test https://github.com/taoqf/node-html-parser/blob/master/test/html.js#L323
  2. I could find any encode/decode thing at rawAttributes, only decode attributes, this may comes from origin registry, and I did know why should we do the encode/decode thing at attributes. if you know some thing about this, please show me.

Thanks for your report.

Sharcoux commented 4 years ago

And what about this one:

                    it('should edit the attributes of the element', function () {
            var root = parseHTML('<p a=12></p>');
                            var attr = root.firstChild.attributes;
            root.firstChild.setAttribute('a', 13);
            attr.should.eql({
                'a': '13',
            });
            root.firstChild.toString().should.eql('<p a=13></p>');
        });
Sharcoux commented 4 years ago

About your other question @taoqf , check this

taoqf commented 4 years ago

I added your tests and fixed it. Would this be OK?

Sharcoux commented 4 years ago

If I may afford a few remarks.

  1. In this code:
            const val = JSON.stringify(attrs[name]);
            if (val === undefined || val === null) {
                return name;

return name is unreachable as val will never be null or undefined. It may only be "null" or "undefined"

  1. You fixed the issue with setAttribute, but not with setAttributes.

  2. You seem to be taking into account the abilty to handle attributes like: <input required/>, which is great. But you'll probably need to make attributes['required'] and getAttribute('required') to return true instead of undefined.

  3. With you implementation,in setAttributes and setAttribute, you consider that the attributes must be provided encoded. Is this really user friendly?

taoqf commented 4 years ago

Thanks for your report.

  1. I have fixed that.
  2. I suggest we use setAttribute, getAttribute,hasAttributeandremoveAttribute,actually there is no method named setAttributes in browser, just a getter attributes which is not same behavior in our library. and I think reference is a really bad thing, maybe we just should return a copy in getter attributes.
  3. getAttribute('required') would return '', not true.
  4. No, I thin user would never concern about encode/decode , see https://github.com/taoqf/node-html-parser/commit/193a68576623c7d0348c84c6b34e15207242ae61#diff-8e5f1331d13915fbd871f0a650422099R507
Sharcoux commented 4 years ago
  1. Yes, there are different points of view about this. Some code architectures make references acceptable and useful, and some make them prohibited and sources of bugs. Do as you wish, but personnally, I consider that getXXX will return an independant object whereas object.XXX will give me a reference.

there is no method named setAttributes in browser -> Ok, but how should we edit the attributes then? Or do you want to create a set attributes setter?

  1. You're right, my bad. However, how are we supposed to set such an attribute with setAttribute? setAttribute('required', '') will generate <p required=""></p>

  2. What I mean is that it would probably be better for users that:

            root.firstChild.setAttribute('alt', '«Sogno');
            root.firstChild.rawAttributes.alt.should.eql('&laquo;Sogno');

Don't you think?

  1. line 520:
        if (arguments.length < 2) {
            throw new Error('Failed to execute \'setAttribute\' on \'Element\'');
        }

    This is not explicit at all... Why not precise that setAttribute requires 2 parameters and received only arguments.length?

Sharcoux commented 4 years ago

Personnally, I would vote for this code:

    /**
     * Set an attribute value to the HTMLElement
     * @param {string} key The attribute name
     * @param {string} value The value to set
     */
    setAttribute(key: string, value: string) {
        //Update the attributes map
        const attrs = this.attributes;
        attrs[key] = value+'';
        //Update the raw attributes
        if(this._rawAttrs) this._rawAttrs[key] = encode(value+'');
        //Update rawString
        this.rawAttrs = Object.keys(attrs).map(attr => attr+(attrs[attr]==='' ? '' : ('="'+encode(attrs[attr])+'"'))).join(' ');
    }

    /**
     * Replace all the attributes of the HTMLElement by the provided attributes
     * @param {Attributes} attributes the new attribute set
     */
    setAttributes(attributes: Attributes) {
        //Update the attributes map
        if(this.attributes) {
            Object.keys(this.attributes).forEach(key => delete this.attributes[key]);
            Object.keys(attributes).forEach(key => this.attributes[key] = attributes[key]+'');
        }
        //Update the raw attributes map
        if(this.rawAttributes) {
            Object.keys(this.rawAttributes).forEach(key => delete this.rawAttributes[key]);
            Object.keys(attributes).forEach(key => this.rawAttributes[key] = encode(attributes[key]+''));
        }
        //Update rawString
        this.rawAttrs = Object.keys(attributes).map(attr => attr+(attributes[attr]==='' ? '' : ('="'+encode(attributes[attr]+'')+'"'))).join(' ');
    }
}
taoqf commented 4 years ago
  1. I don't know either.
  2. now it is like that. the test will passed.
  3. I did this just because browser act like this.
Sharcoux commented 4 years ago

Should I propose a PR to fix 3, 4 and 5 and propose an implementation of setAttributes that avoid doing a serie of removeAttribute() / setAttribute() when replacing the current set of attributes?

And about point 2, what is your final position?

taoqf commented 4 years ago

About 2,I will say let use setAttribute. and as you see, I have removed attributes in doc, https://github.com/taoqf/node-html-parser#htmlelementsetattributekey-string-value-string prs are always welcomed

Sharcoux commented 4 years ago

Tell me if this looks ok.

I have removed attributes in doc

Why would you do that? How are users supposed to retrieve the list of attributes of a node then?

taoqf commented 4 years ago

As a html parser, I think maybe we have gone too far. I'm afraid I cannot merge your pr for now, I want more voice about this. Users can use attributes as before, but I don't encourage them to do that.https://github.com/taoqf/node-html-parser/pull/23#pullrequestreview-355527644 First, I would the methods act like browser's behavior, second, I prefer all methods keep running as fast as before. as I said, jsdom may be a better choice when users want more ability.

Sharcoux commented 4 years ago

I don't see your point here:

  1. Adding a setAttribute and setAttributes methods (and even the removeAttribute + getAttribute methods added by minas) won't change anything to the users that don't use it. It won't make the lib slower or anything as the code is run only if we need to change an attribute

  2. Parsing html and xml and not being able to edit it is really a BIG limitation.

taoqf commented 3 years ago

I will close this becuase setAttributes is already added .