Closed thouger closed 6 years ago
Hi @thouger,
Thanks for using the package.
You should not use structural_similarity
here because the html tags in the forum can increase dramatically because users can add p
, a
, font
tags as they need. You can use it but you should give more weight to the similarity (1 - k = 0.7) (k=0.3). My suggestion is to use k = 0.3 and set a low threshold.
In [35]: from html_similarity import similarity, style_similarity, structural_similarity
In [36]: structural_similarity(html_1, html_3)
Out[36]: 0.1694300518134715
In [37]: style_similarity(html_1, html_3)
Out[37]: 0.4748603351955307
In [55]: from html_similarity.style_similarity import get_classes
In [56]: class_html_1 = get_classes(html_1)
In [57]: len(class_html_1)
Out[57]: 146
In [58]: class_html_3 = get_classes(html_3)
In [59]: len(class_html_3)
Out[59]: 118
In [60]: len(class_html_1 & class_html_3)
Out[60]: 85
The similarity should work for the second case:.
In [11]: from html_similarity import similarity
In [12]: html_1 = open('android-9161132-1-1.html').read()
In [13]: html_2 = open('android-9172442-2-1.html').read()
In [14]: similarity(html_1, html_2)
Out[14]: 0.6515255079848381
I got 0.65. As I see, the web page. The forum allows you to add your own content (multiple font and p elements which make the structure differ). In this case I suggest to use less weight on the structure.
Using k=0.3
I got
In [26]: similarity(html_1, html_2, 0.3)
Out[26]: 0.7067047784751134
I hope it helps. Let me know if you have any other question or doubt.
I very grateful for you answer! I got success at part-1,but i didn't get 0.6515255079848381 at part-2,i suspectd you source for html_1 and html_2,so i try this: I use requests,urllib.request , phantomjs and save save page as html by browser to get html source,but i got this requests
similarity(h1,h2)
0.5741937488348972
urllib.request
similarity(h3.decode('utf-8'),h4.decode('utf-8'))
0.5741937488348972
phantomjs
similarity(h9,h10)
0.5900613016825356
save save page as html by browser
similarity(h5,h6)
0.5897597237261845
But i didn't get so high like you,so i want to know the method that you open url and save.I think it's a key.
I'm using python 3.6 and I use wget to download the pages. Bear in mind that the website maybe sending additional information to me because I'm located in South America. For the last part you can use the threshold of 0.55
to consider to pages similar.
@thouger, here is the html that I'm using https://www.dropbox.com/sh/6p0f4e9k9ldei6j/AABTb-ApCNfq6cdcWVHMAx2ca?dl=0
At last i got the same answer as you,thank you for your help.i think i used the wrong way. Although South America i think it's ok because I am not hurry up and the hope that you show me make me continue to follow. Actually i am writing my bachelor Thesis which is 《form data extraction》.Before i see this package,i am leaning simple tree matching.When i see this package,i think:it do excellent and easy way more than stm!I think the package maybe can help me to finish my paper.So i hope i can share the idea and analysis the structure in my bachelor Thesis.I will give clear indication of you name and where the package from. A man who show the bachelor Thesis when he graduated take me to love the data extraction at three years ago.He's in my same major but three year ahead.now when i prepare to graduate,i hope i can succeed to take more people to love this by use my parper. i will very grateful for you if you agree.
Cool. This package uses a heuristic to measure the structural similarity. I will do some experiments on my own today and check if I find something.
Great to hear about form data extraction. That's sound really interesting. I would like to know what is the strategy your are planing to to the form data extraction.
Note: There is a package called Formsaurus which extract forms from web pages. Formsaurus classify which form is in a web page (login, signup, search, mailing list, etc)
import formsaurus
import requests
html = requests.get('http://github.com/')
formsaurus.extract_forms(html)
Formsaurus uses Logistic regression to make the classification using the following features:
You can read more about it in: http://formasaurus.readthedocs.io/en/latest/
I am very sorry(;´д`)ゞ because i missing a letter of form,it forum not form.In my parper,i extract the url of next page,forum user data,forum post data,post url. I will be very happy to share strategy with you.But i can't show you in github issues because the original parper i refer to is 88 pages.also we can use other chat tool.I am also very interested in your other repository so i will spend some time to see. Thank for you https://github.com/TeamHG-Memex/Formasaurus,i never hear that before.I hope you do not mind my mistake about missing u.Now I am going to sleep.
Hi @thouger, sorry for the delay. I will make the experiments over the weekend. I was busy these days.
I got some ideas to solve your problem:
Have fun doing you paper. I like the problem feel free to ask any question. My email is my-github-username AT gmail DOT com ;)
Thanks for you help!I thought you would never reply to me.You give too much information so that i much spend some time to analysis.
It is wonderful to meet you when i doing related research.I am fascinated with above you gave me.The next time I will learn it.
i had to compare the url is:[http://bbs.gfan.com/forum-22-1.html,http://bbs.gfan.com/android-9172442-2-1.html] by using similarity function,i got 0.3444178082191781,but when i compare [http://bbs.gfan.com/android-9172442-2-1.html,http://bbs.gfan.com/android-9161132-1-1.html],i only get 0.358913813459268,what do you think?i think it should be similar