jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.88k stars 2.17k forks source link

Documents with same content should be equals #455

Closed d0x closed 9 years ago

d0x commented 10 years ago

The Element implements an object reference check in equals(). Is it possible to provide another one using the content of the element? Or is that a design decision?

Example test:

    @Test
    public void documents_with_same_content_should_be_equal() throws Exception
    {
        Document docA = Jsoup.parse("<div/>");
        Document docB = Jsoup.parse("<div/>");

        assertEquals(docA, docB);
    }

Equals Implementation:

public class Element {
// ...
@Override
    public boolean equals(Object o) {
        return this == o;
    }
// ...
}
offa commented 9 years ago

This would be nice, but may cause some problems: While it's ok for small docs, it might end up in a huge iteration to visit all nodes of a document to check equality.

jhy commented 9 years ago

Thanks, I implemented this just recently. It works as expected, but as @offa the runtime could potentially be slow for large, almost equal docs. But up to the user, and at least it works as expected. Can always use == if the previous behavior is really wanted (but you would already have been doing that.)

magneticflux- commented 6 years ago

@jhy Are you sure this is implemented? I'm using 1.11.2 and it doesn't seem to have an effect. I don't see how that test case could pass currently. I have to compare the toString() values instead as a workaround.

EDIT: Ah, I see now that the change was reverted and functionality was moved to the hasSameValue method. It's a bit annoying when using Kotlin's Data Classes since they automatically use the equals method instead of your custom one. It seems a bit odd to revert it since there's already a way to test reference equality in Java == and in Kotlin ===.

jhy commented 6 years ago

It's not ideal either way but was much worse when implemented, because it blew up everyone using hashmaps / sets to store documents. At least as it is you can use the custom method.

magneticflux- commented 6 years ago

If only there was something like a Comparator<T> but for equals and hashCode 😄.

I feel like there might be more efficient ways to implement equals and hashCode. I'm also curious why people were using the Element class as a Key in a map or putting it into a set, since they won't work correctly (technically they will but they're pretty useless) unless equals and hashCode are implemented.

I'd be inclined to call it unintended behavior, bump the minor version number, and tell users that run into issues to either use a list or an IdentityHashMap/IdentityHashSet (included since Java 1.4). My reasoning is that I feel that not implementing equals and hashCode for an obvious and useful datastructure subverts developer expectations (as evidenced by the number of people using sets).

If the above is too drastic, it also may be possible to allow users to disable the new behavior by including a jsoup.properties file or something that could be read to set compatibility options. I know ical4j uses a similar method to relax certain parsing options for compatibility and specify custom factory classes.