Add `-no-clobber` to avoid overwriting files that already exist

projectdiscovery / katana

A next-generation crawling and spidering framework.

MIT License

10.9k stars 578 forks source link

Add `-no-clobber` to avoid overwriting files that already exist #749

Open xqbumu opened 8 months ago

xqbumu commented 8 months ago

Please describe your feature request:

From the current logic, in Katana, setting 'srd' allows you to save the crawled content. However, when executing it for the second time, the content in that directory will be cleared. I hope to support incremental crawling, which means:

The directory should not be cleared during the second execution;
When encountering a request that has already been saved, skip crawling that link.

Describe the use case of this feature:

Replacing the -nc option in wget: use cases

wget -P ./output -nc -i urls.txt

Refer: https://github.com/projectdiscovery/katana/blob/main/pkg/output/output.go#L120

dogancanbakir commented 8 months ago

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

xqbumu commented 8 months ago

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

Thank you for your reply. I have used this switch (-resume), but it only works for resuming interrupted crawling. When I modify my urls.txt file, Katana should not be able to perform incremental crawling.

dogancanbakir commented 8 months ago

@xqbumu, Makes sense!

@Mzack9999, Thoughts? - "Incremental Crawling" sounds good to me 💭

Mzack9999 commented 7 months ago

This is for sure an interesting feature, but I'm not sure it can be fully applied to the crawling process. While it's easy to mimic it avoiding overwriting existing files, abandoning the crawling process requires some more thoughts, as it can't be simply based on the existence of the file, as for example, it would end the crawling at the very beginning since the root branch already exist. Maybe some better strategy can be adopted, for example:

Crawl normally till a minimum depth (2?)
Above that depth, if the crawled page is the same of existing one (or all children of parent node are the same above a certain threshold) => break out

What do you think?

xqbumu commented 7 months ago

@Mzack9999

Thank you for your response. My initial expectation was to be able to continue crawling the remaining links after an interruption. However, the re-crawling strategy you mentioned here seems to enhance the ability to resume crawling.

As for the re-crawling strategy, I feel that in addition to defining the depth of the links, it could also consider judging based on the modification time of the crawled files, as it is easier to determine data updates based on time.

The above is just my personal opinion, and I welcome your guidance.

dogancanbakir commented 6 months ago

@Mzack9999,

My initial expectation was to be able to continue crawling the remaining links after an interruption.

Let's begin with this idea and then gradually develop it further. What do you say?