minimalparts / PeARS

Archive repository for the PeARS project. Please head over to https://github.com/PeARSearch/PeARS-orchard for the latest version.
MIT License
17 stars 21 forks source link

HTML Parser #33

Closed veesa closed 8 years ago

veesa commented 8 years ago

Overview

Here are the files that will create a database of pages that excludes domains from a comma-delineated user supplied list called .pearsignore, that is located in the home directory.

The reason the .pearsignore file is in the home directory is that we wanted users to be able to keep their exclude list private. This way it will never be committed into the repository. Thoughts around this strategy are appreciated.

I would suggest we at least document a generic .pearsignore, because it would save people a lot of time configuring their own. This programme can take some time to run dependant upon the size of the original Firefox browsing history. At this time it prints to standard out to enable you to see what urls are being committed and omitted.

.pearsignore.txt

Run retrieve_pages.py to execute.

See commit message for details.

This is a very rough draft but I have run out of urls to test it with.

Assumptions

The user is running a linux distribution and is using the latest Firefox as a browser.

minimalparts commented 8 years ago

Hey Veesa, looks like my Firefox history might not have the same format as yours :-/

~/PeARS$ python ./retrieve_pages.py 
Linux 3.11.0-26-generic
/home/aurelie

/home/aurelie/.mozilla/firefox
/home/aurelie/.mozilla/firefox/j1hoqosk.default/places.sqlite
Traceback (most recent call last):
  File "./retrieve_pages.py", line 110, in <module>
    retrieve_pages()
  File "./retrieve_pages.py", line 42, in retrieve_pages
    cur.execute("SELECT * FROM History;")
sqlite3.OperationalError: no such table: History

I do get my urls by doing

sqlite3 ~/.mozilla/firefox/j1hoqosk.default/places.sqlite "SELECT url FROM moz_places"

I'm running Firefox 39.0.3 on Ubuntu 12.04.

minimalparts commented 8 years ago

Hi, okay I get it now. I commented the line

if not HISTORY_DB:

Basically, create_history_db wasn't getting called. I think first-time use glitch. The rest seems to be working beautifully :)

minimalparts commented 8 years ago

beautifulsoup seems to not always strip things in a fully ideal way... I printed body_str for the Ubuntu home page and I get quite a few tags included:

200 - http://www.ubuntu.com/
The leading OS for PC, tablet, phone and cloud | Ubuntu html[if lt IE 7]> <html class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"> <![endif][if IE 7]>    <html class="no-js lt-ie10 lt-ie9 lt-ie8" lang="en" dir="ltr"> <![endif][if IE 8]>    <html class="no-js lt-ie10 lt-ie9" lang="en" dir="ltr"> <![endif][if IE 9]>    <html class="no-js lt-ie10" lang="en" dir="ltr"> <![endif][if gt IE 8]><!<![endif][if IE]>
<meta http-equiv="X-UA-Compatible" content="IE=8">
<![endif]The leading OS for PC, tablet, phone and cloud | Ubuntu stylesheets  javascript 
 google tag manager 

 end google tag manager 

etc...

I guess if we keep a clean semantic space, it won't matter too much. The question is whether we can..

veesa commented 8 years ago

Have you noticed cruft in a lot of other pages as well? Or was it just this one that was particularly egregious?

Just as an FYI I was using Wikipedia as a baseline.

We're not going to be able to get clean data from every page, but I'm sure more can be done with BeautifulSoup.

On Sat, Dec 26, 2015 at 5:40 PM, Aurelie Herbelot notifications@github.com wrote:

beautifulsoup seems to not always strip things in a fully ideal way... I printed body_str for the Ubuntu home page and I get quite a few tags included:

200 - http://www.ubuntu.com/ The leading OS for PC, tablet, phone and cloud | Ubuntu html[if lt IE 7]> <![endif][if IE 7]> <![endif][if IE 8]> <![endif][if IE 9]> <![endif][if gt IE 8]><!<![endif][if IE]>

<![endif]The leading OS for PC, tablet, phone and cloud | Ubuntu stylesheets javascript google tag manager

end google tag manager

etc...

I guess if we keep a clean semantic space, it won't matter too much. The question is whether we can..

— Reply to this email directly or view it on GitHub https://github.com/minimalparts/PeARS/pull/33#issuecomment-167347736.

veesa commented 8 years ago

I have added a couple of BeautifulSoup elements to be extracted and tested it with ubuntu.com. It seemed to fix the problem but please keep testing on your url's and providing feedback!

minimalparts commented 8 years ago

Well, I'm giving it a tough test by just going through my history without filter (well, nearly without -- I'm using your .pearsignore) :) Somehow, I was still getting issues on ubuntu.com with the new beautifulsoup elements. Things are on the whole better if I do

body = bs_obj.body.get_text()

instead of

body = bs_obj.get_text()

i.e. just consider the html body anyway. One bad page I've come across is http://www.cl.cam.ac.uk/ (see html left at the end of the parsed page).