tatanus / metagoofil

Automatically exported from code.google.com/p/metagoofil
GNU General Public License v2.0
1 stars 0 forks source link

Regular expression adds false-matches to the number of documents returned in query #2

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run metagoofil.py as expected
2. Observe the number of documents found always 5 more than what was specified

What is the expected output? What do you see instead?

The expected number of matches is always 5 more than requested.  This is due to 
cruft at the bottom of the Google results page that is being matched by the 
compiled regular expression for "<a href=".

What version of the product are you using? On what operating system?
metagoofil-read-only from SVN dated May 16, 2011 (revision 2)

Please provide any additional information below.

The indication of the erroneous matches is seen from metagoofil.py's output 
from the googlesearch.py's invocation of parser.py's fileurls call.  Note the 
output below where "Searching 100 results..." is followed by "Results: 105 
files found".
-------
[-] Searching for doc files, with a limit of 10
        Searching 100 results...
Results: 105 files found
Starting to download 10 of them..

From inspection, the 5 extra matches are:
 '/'
 '/intl/en/ads/'
 '/services/'
 '/intl/en/privacy.html'
 '/intl/en/about.html' 

and are being matched from URLs in the footer of the google search page.

Patch attached to address the additional (spurious) pattern matches.

Original issue reported on code.google.com by lnxp...@gmail.com on 16 Jun 2011 at 3:00

Attachments: