Closed veesa closed 8 years ago
Hey Veesa, looks like my Firefox history might not have the same format as yours :-/
~/PeARS$ python ./retrieve_pages.py
Linux 3.11.0-26-generic
/home/aurelie
/home/aurelie/.mozilla/firefox
/home/aurelie/.mozilla/firefox/j1hoqosk.default/places.sqlite
Traceback (most recent call last):
File "./retrieve_pages.py", line 110, in <module>
retrieve_pages()
File "./retrieve_pages.py", line 42, in retrieve_pages
cur.execute("SELECT * FROM History;")
sqlite3.OperationalError: no such table: History
I do get my urls by doing
sqlite3 ~/.mozilla/firefox/j1hoqosk.default/places.sqlite "SELECT url FROM moz_places"
I'm running Firefox 39.0.3 on Ubuntu 12.04.
Hi, okay I get it now. I commented the line
if not HISTORY_DB:
Basically, create_history_db wasn't getting called. I think first-time use glitch. The rest seems to be working beautifully :)
beautifulsoup seems to not always strip things in a fully ideal way... I printed body_str for the Ubuntu home page and I get quite a few tags included:
200 - http://www.ubuntu.com/
The leading OS for PC, tablet, phone and cloud | Ubuntu html[if lt IE 7]> <html class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"> <![endif][if IE 7]> <html class="no-js lt-ie10 lt-ie9 lt-ie8" lang="en" dir="ltr"> <![endif][if IE 8]> <html class="no-js lt-ie10 lt-ie9" lang="en" dir="ltr"> <![endif][if IE 9]> <html class="no-js lt-ie10" lang="en" dir="ltr"> <![endif][if gt IE 8]><!<![endif][if IE]>
<meta http-equiv="X-UA-Compatible" content="IE=8">
<![endif]The leading OS for PC, tablet, phone and cloud | Ubuntu stylesheets javascript
google tag manager
end google tag manager
etc...
I guess if we keep a clean semantic space, it won't matter too much. The question is whether we can..
Have you noticed cruft in a lot of other pages as well? Or was it just this one that was particularly egregious?
Just as an FYI I was using Wikipedia as a baseline.
We're not going to be able to get clean data from every page, but I'm sure more can be done with BeautifulSoup.
On Sat, Dec 26, 2015 at 5:40 PM, Aurelie Herbelot notifications@github.com wrote:
beautifulsoup seems to not always strip things in a fully ideal way... I printed body_str for the Ubuntu home page and I get quite a few tags included:
200 - http://www.ubuntu.com/ The leading OS for PC, tablet, phone and cloud | Ubuntu html[if lt IE 7]> <![endif][if IE 7]> <![endif][if IE 8]> <![endif][if IE 9]> <![endif][if gt IE 8]><!<![endif][if IE]>
<![endif]The leading OS for PC, tablet, phone and cloud | Ubuntu stylesheets javascript google tag manager
end google tag manager
etc...
I guess if we keep a clean semantic space, it won't matter too much. The question is whether we can..
— Reply to this email directly or view it on GitHub https://github.com/minimalparts/PeARS/pull/33#issuecomment-167347736.
I have added a couple of BeautifulSoup elements to be extracted and tested it with ubuntu.com. It seemed to fix the problem but please keep testing on your url's and providing feedback!
Well, I'm giving it a tough test by just going through my history without filter (well, nearly without -- I'm using your .pearsignore) :) Somehow, I was still getting issues on ubuntu.com with the new beautifulsoup elements. Things are on the whole better if I do
body = bs_obj.body.get_text()
instead of
body = bs_obj.get_text()
i.e. just consider the html body anyway. One bad page I've come across is http://www.cl.cam.ac.uk/ (see html left at the end of the parsed page).
Overview
Here are the files that will create a database of pages that excludes domains from a comma-delineated user supplied list called .pearsignore, that is located in the home directory.
The reason the .pearsignore file is in the home directory is that we wanted users to be able to keep their exclude list private. This way it will never be committed into the repository. Thoughts around this strategy are appreciated.
I would suggest we at least document a generic .pearsignore, because it would save people a lot of time configuring their own. This programme can take some time to run dependant upon the size of the original Firefox browsing history. At this time it prints to standard out to enable you to see what urls are being committed and omitted.
.pearsignore.txt
Run retrieve_pages.py to execute.
See commit message for details.
This is a very rough draft but I have run out of urls to test it with.
Assumptions
The user is running a linux distribution and is using the latest Firefox as a browser.