matiskay / html-similarity

Compare html similarity using structural and style metrics
BSD 3-Clause "New" or "Revised" License
209 stars 23 forks source link

but i think sometime it not similar #1

Closed thouger closed 6 years ago

thouger commented 6 years ago

i had to compare the url is:[http://bbs.gfan.com/forum-22-1.html,http://bbs.gfan.com/android-9172442-2-1.html] by using similarity function,i got 0.3444178082191781,but when i compare [http://bbs.gfan.com/android-9172442-2-1.html,http://bbs.gfan.com/android-9161132-1-1.html],i only get 0.358913813459268,what do you think?i think it should be similar

matiskay commented 6 years ago

Hi @thouger,

Thanks for using the package.

Similarity between http://bbs.gfan.com/forum-22-1.html and http://bbs.gfan.com/android-9172442-2-1.html

You should not use structural_similarity here because the html tags in the forum can increase dramatically because users can add p, a, font tags as they need. You can use it but you should give more weight to the similarity (1 - k = 0.7) (k=0.3). My suggestion is to use k = 0.3 and set a low threshold.

In [35]: from html_similarity import similarity, style_similarity, structural_similarity

In [36]: structural_similarity(html_1, html_3)
Out[36]: 0.1694300518134715

In [37]: style_similarity(html_1, html_3)
Out[37]: 0.4748603351955307
In [55]: from html_similarity.style_similarity import get_classes

In [56]: class_html_1 = get_classes(html_1)

In [57]: len(class_html_1)
Out[57]: 146

In [58]: class_html_3 = get_classes(html_3)

In [59]: len(class_html_3)
Out[59]: 118

In [60]: len(class_html_1 & class_html_3)
Out[60]: 85

Similarity between http://bbs.gfan.com/android-9172442-2-1.html and http://bbs.gfan.com/android-9161132-1-1.html

The similarity should work for the second case:.

In [11]: from html_similarity import similarity

In [12]: html_1 = open('android-9161132-1-1.html').read()

In [13]: html_2 = open('android-9172442-2-1.html').read()

In [14]: similarity(html_1, html_2)
Out[14]: 0.6515255079848381

I got 0.65. As I see, the web page. The forum allows you to add your own content (multiple font and p elements which make the structure differ). In this case I suggest to use less weight on the structure.

Using k=0.3 I got

In [26]: similarity(html_1, html_2, 0.3)
Out[26]: 0.7067047784751134

I hope it helps. Let me know if you have any other question or doubt.

thouger commented 6 years ago

I very grateful for you answer! I got success at part-1,but i didn't get 0.6515255079848381 at part-2,i suspectd you source for html_1 and html_2,so i try this: I use requests,urllib.request , phantomjs and save save page as html by browser to get html source,but i got this requests

similarity(h1,h2)
0.5741937488348972

urllib.request

similarity(h3.decode('utf-8'),h4.decode('utf-8'))
0.5741937488348972

phantomjs

similarity(h9,h10)
0.5900613016825356

save save page as html by browser

similarity(h5,h6)
0.5897597237261845

But i didn't get so high like you,so i want to know the method that you open url and save.I think it's a key.

matiskay commented 6 years ago

I'm using python 3.6 and I use wget to download the pages. Bear in mind that the website maybe sending additional information to me because I'm located in South America. For the last part you can use the threshold of 0.55 to consider to pages similar.

matiskay commented 6 years ago

@thouger, here is the html that I'm using https://www.dropbox.com/sh/6p0f4e9k9ldei6j/AABTb-ApCNfq6cdcWVHMAx2ca?dl=0

thouger commented 6 years ago

At last i got the same answer as you,thank you for your help.i think i used the wrong way. Although South America i think it's ok because I am not hurry up and the hope that you show me make me continue to follow. Actually i am writing my bachelor Thesis which is 《form data extraction》.Before i see this package,i am leaning simple tree matching.When i see this package,i think:it do excellent and easy way more than stm!I think the package maybe can help me to finish my paper.So i hope i can share the idea and analysis the structure in my bachelor Thesis.I will give clear indication of you name and where the package from. A man who show the bachelor Thesis when he graduated take me to love the data extraction at three years ago.He's in my same major but three year ahead.now when i prepare to graduate,i hope i can succeed to take more people to love this by use my parper. i will very grateful for you if you agree.

matiskay commented 6 years ago

Cool. This package uses a heuristic to measure the structural similarity. I will do some experiments on my own today and check if I find something.

Great to hear about form data extraction. That's sound really interesting. I would like to know what is the strategy your are planing to to the form data extraction.

Note: There is a package called Formsaurus which extract forms from web pages. Formsaurus classify which form is in a web page (login, signup, search, mailing list, etc)

import formsaurus
import requests
html = requests.get('http://github.com/')
formsaurus.extract_forms(html)

speaker deck 2017-11-08 09-17-14

Formsaurus uses Logistic regression to make the classification using the following features:

You can read more about it in: http://formasaurus.readthedocs.io/en/latest/

thouger commented 6 years ago

I am very sorry(;´д`)ゞ because i missing a letter of form,it forum not form.In my parper,i extract the url of next page,forum user data,forum post data,post url. I will be very happy to share strategy with you.But i can't show you in github issues because the original parper i refer to is 88 pages.also we can use other chat tool.I am also very interested in your other repository so i will spend some time to see. Thank for you https://github.com/TeamHG-Memex/Formasaurus,i never hear that before.I hope you do not mind my mistake about missing u.Now I am going to sleep.

matiskay commented 6 years ago

Hi @thouger, sorry for the delay. I will make the experiments over the weekend. I was busy these days.

I got some ideas to solve your problem:

Have fun doing you paper. I like the problem feel free to ask any question. My email is my-github-username AT gmail DOT com ;)

thouger commented 6 years ago

Thanks for you help!I thought you would never reply to me.You give too much information so that i much spend some time to analysis.

  1. About classifier to detect forum is i have never thought of that.After i shallowly search i find naive Bayes is a amazing.It has profound significance that the world is uncertain because human boservation has limitations. Naive Bayes is to hypothesis some we can't see according to what we can see.I'm sorry my English ability is limited(;´д`)ゞ(;´д`)ゞ
  2. About second point is the focus of my research but i have other discovery which is for the forum.I will share with you after i complete.
  3. About third points i find the paper(but it wrote in chinese!) and i already write in java but I'm still curious what else can be done in autopager.

It is wonderful to meet you when i doing related research.I am fascinated with above you gave me.The next time I will learn it.