Closed GoogleCodeExporter closed 8 years ago
Original comment by jbrunell...@gmail.com
on 1 May 2012 at 10:33
Breaking this into two parts:
The first is the subdirectory issue: only recovering from a particular
subdirectory and not creeping into parent directories.
The second is the ability to exclude patterns of resources, such as the
?page=... example in the feature request. I've started the first.
Original comment by jbrunell...@gmail.com
on 1 May 2012 at 10:36
Just to be clear:
1. *Crawling* parent directories may still be necessary to ensure complete
coverage of the subdirectory contents for reconstruction.
2. IMHO, excluding a pattern is the general capability of which excluding
particular directories is a specific use-case
So I still think that the best form for these capabilities is something like
the following:
--donotcrawl PATTERN
--donotreconstruct PATTERN
...where PATTERN is a regex that is matched against the directory path starting
from the site root (that is, from the initial /, not including the domain).
Of the two, --donotreconstruct is more immediately useful, where --donotcrawl
is more of a performance optimization.
It might also be nice to be able to specify a file with several patterns, one
per line, rather than using a --donot switch several times in the commandline.
Original comment by mich...@urbsly.com
on 1 May 2012 at 1:09
Completed the -sd flag (indicating that Warrick should only recover content in
the specified subdirectory)
For example:
if you provide http://myfolia.com/plants/, warrick will only recover things
that come from the myfolia.com/plants subdirectory (no going up to the parents).
I also provided the -ex|--exclude <FILE> feature. This will allow you to
provide a file of regular expressions that you want to exclude from the
recovery. For example, my test file looks like this:
myfolia\.com\/plants\/3581.*
staticweb\.archive\.org\/.*
myfolia\.com\/plants\/search\?page=.*
Meaning:
1) I don't want anything that starts with myfolia.com/plants/3581
2) I don't want any stylings or JS from the archive
3) I don't want any search pages from the plants subdirectory
so I can call warrick as:
perl warrick.pl -sd -ex /home/jbrunelle/regex.in http://myfolia.com/plants
And I won't get any of the URLs matching the REGEXs.
Original comment by jbrunell...@gmail.com
on 4 May 2012 at 2:51
Original issue reported on code.google.com by
mich...@urbsly.com
on 27 Apr 2012 at 3:03