Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new to scrapy so it's been an interesting way to start learning about that.
I've been trying to make a snapshot of the whole site (or as much of it as is contained in the waybackmachine) at a particular time following the instruction here to set the from and to timestamps to the same value. However, when I do this, I only get a very incomplete snapshot of the site. If I open up the from and to range I get many more pages (but also a lot of snapshots I'm not interested in!)
I've looked at the logic in the filter_snapshots function and it all makes sense - essentially it keeps each snapshot before time_range[0] in a holding variable initial_snapshot and if the filtered_snapshots list is still empty when the time_range[1] is reached then that goes into filtered_snapshot list as the only snapshot.
Have you seen any problems like this before? Possibly related is that even if I expand the time range, then some pages don't get picked up and I have to re-run with a more specific URL to retrieve some subfolders of the site. The behaviour is consistent between runs, so I don't think it's timing out or anything, it's just not crawling to those pages for some reason. I've tried setting DEPTH_LIMIT in __main.py__ and when I run the command line it echoes the setting back to me, but that doesn't seem to make any difference.
Hi, thanks for an interesting and useful project which has helped me make a start on reconstructing a site that would be really useful for a research project. I'm new to scrapy so it's been an interesting way to start learning about that.
I've been trying to make a snapshot of the whole site (or as much of it as is contained in the waybackmachine) at a particular time following the instruction here to set the from and to timestamps to the same value. However, when I do this, I only get a very incomplete snapshot of the site. If I open up the from and to range I get many more pages (but also a lot of snapshots I'm not interested in!)
I've looked at the logic in the
filter_snapshots
function and it all makes sense - essentially it keeps each snapshot beforetime_range[0]
in a holding variableinitial_snapshot
and if thefiltered_snapshots
list is still empty when thetime_range[1]
is reached then that goes intofiltered_snapshot
list as the only snapshot.Have you seen any problems like this before? Possibly related is that even if I expand the time range, then some pages don't get picked up and I have to re-run with a more specific URL to retrieve some subfolders of the site. The behaviour is consistent between runs, so I don't think it's timing out or anything, it's just not crawling to those pages for some reason. I've tried setting
DEPTH_LIMIT
in__main.py__
and when I run the command line it echoes the setting back to me, but that doesn't seem to make any difference.