What steps will reproduce the problem?
1. Run metagoofil.py as expected
2. Observe the number of documents found always 5 more than what was specified
What is the expected output? What do you see instead?
The expected number of matches is always 5 more than requested. This is due to
cruft at the bottom of the Google results page that is being matched by the
compiled regular expression for "<a href=".
What version of the product are you using? On what operating system?
metagoofil-read-only from SVN dated May 16, 2011 (revision 2)
Please provide any additional information below.
The indication of the erroneous matches is seen from metagoofil.py's output
from the googlesearch.py's invocation of parser.py's fileurls call. Note the
output below where "Searching 100 results..." is followed by "Results: 105
files found".
-------
[-] Searching for doc files, with a limit of 10
Searching 100 results...
Results: 105 files found
Starting to download 10 of them..
From inspection, the 5 extra matches are:
'/'
'/intl/en/ads/'
'/services/'
'/intl/en/privacy.html'
'/intl/en/about.html'
and are being matched from URLs in the footer of the google search page.
Patch attached to address the additional (spurious) pattern matches.
Original issue reported on code.google.com by lnxp...@gmail.com on 16 Jun 2011 at 3:00
Original issue reported on code.google.com by
lnxp...@gmail.com
on 16 Jun 2011 at 3:00Attachments: