Headless mode missing rendered content if concurrency is more than 1

filipnyquist commented 1 year ago

katana version:

[INF] Current katana version v1.0.2 (latest)

Current Behavior:

While scanning sites in headless mode, using the default concurrency of 10, some URLs are not picked up from pages as the page isn´t in "view" by for example the React DOM, which only seems to render it when the page is "viewed".

Expected Behavior:

In headless mode, the page should be fully viewed for all links and allow the functions to run before moving on to next URL.

Steps To Reproduce:

Run katana -u https://ginandjuice.shop/catalog -hl (an example page which renders this with ReactDOM, see attatched code in "Anything else".

Note the output:

[INF] Current katana version v1.0.2 (latest)
[INF] Started headless crawling for => https://ginandjuice.shop/catalog
https://ginandjuice.shop/resources/js/subscribeNow.js
https://ginandjuice.shop/resources/js/angular_1-7-7.js
https://ginandjuice.shop/resources/js/react.development.js
https://ginandjuice.shop/resources/css/labsScanme.css
https://ginandjuice.shop/resources/footer/js/scanme.js
https://ginandjuice.shop/resources/css/labsEcommerce.css
https://ginandjuice.shop/resources/labheader/css/scanMeHeader.css
https://ginandjuice.shop/resources/js/stockCheck.js
https://ginandjuice.shop/resources/js/xmlStockCheckPayload.js
https://ginandjuice.shop/resources/js/react-dom.development.js
https://ginandjuice.shop/catalog
https://ginandjuice.shop/catalog/product?productId=12
https://ginandjuice.shop/catalog/product?productId=10
https://ginandjuice.shop/catalog/product?productId=9
https://ginandjuice.shop/catalog/product?productId=11
https://ginandjuice.shop/catalog/product?productId=8
https://ginandjuice.shop/catalog/product?productId=7
https://ginandjuice.shop/catalog/product?productId=5
https://ginandjuice.shop/catalog/product?productId=6
https://ginandjuice.shop/catalog/product?productId=4
https://ginandjuice.shop/catalog/product?productId=2
https://ginandjuice.shop/catalog/cart
https://ginandjuice.shop/resources/js/searchLogger.js
https://ginandjuice.shop/catalog/product?productId=1
https://ginandjuice.shop/catalog/product?productId=3
https://ginandjuice.shop/my-account
https://ginandjuice.shop/resources/js/deparam.js
https://ginandjuice.shop/resources/css/labsBlog.css
https://ginandjuice.shop/about
https://ginandjuice.shop/blog
https://ginandjuice.shop/catalog
https://ginandjuice.shop/
https://ginandjuice.shop/blog/post?postId=1
https://ginandjuice.shop/blog/post?postId=5
https://ginandjuice.shop/blog/post?postId=2
https://ginandjuice.shop/blog/post?postId=6
https://ginandjuice.shop/blog/post?postId=4
https://ginandjuice.shop/blog/post?postId=3

Run the headless crawl again, but with a concurrency of one (making use of only one tab to crawl): katana -u https://ginandjuice.shop/catalog -hl -c 1

Note the output and the difference between the found items:

[INF] Current katana version v1.0.2 (latest)
[INF] Started headless crawling for => https://ginandjuice.shop/catalog
https://ginandjuice.shop/catalog
https://ginandjuice.shop/resources/footer/js/scanme.js
https://ginandjuice.shop/resources/js/subscribeNow.js
https://ginandjuice.shop/catalog?category=Juice
https://ginandjuice.shop/catalog?category=Gin
https://ginandjuice.shop/catalog?category=Books
https://ginandjuice.shop/catalog/
https://ginandjuice.shop/catalog?category=Accompaniments
https://ginandjuice.shop/catalog?category=Accessories
https://ginandjuice.shop/resources/js/angular_1-7-7.js
https://ginandjuice.shop/resources/js/react-dom.development.js
https://ginandjuice.shop/resources/js/react.development.js
https://ginandjuice.shop/resources/css/labsScanme.css
https://ginandjuice.shop/resources/css/labsEcommerce.css
https://ginandjuice.shop/resources/labheader/css/scanMeHeader.css
https://ginandjuice.shop/catalog/product?productId=12
https://ginandjuice.shop/catalog/product?productId=11
https://ginandjuice.shop/catalog/product?productId=10
https://ginandjuice.shop/resources/js/stockCheck.js
https://ginandjuice.shop/resources/js/xmlStockCheckPayload.js
https://ginandjuice.shop/catalog/product?productId=9
https://ginandjuice.shop/catalog/product?productId=8
https://ginandjuice.shop/catalog/product?productId=7
https://ginandjuice.shop/catalog/product?productId=6
https://ginandjuice.shop/catalog/product?productId=5
https://ginandjuice.shop/catalog/product?productId=4
https://ginandjuice.shop/catalog/product?productId=3
https://ginandjuice.shop/catalog/product?productId=2
https://ginandjuice.shop/catalog/product?productId=1
https://ginandjuice.shop/catalog/cart
https://ginandjuice.shop/my-account
https://ginandjuice.shop/about
https://ginandjuice.shop/blog
https://ginandjuice.shop/catalog
https://ginandjuice.shop/resources/js/searchLogger.js
https://ginandjuice.shop/resources/js/deparam.js
https://ginandjuice.shop/resources/css/labsBlog.css
https://ginandjuice.shop/blog/post?postId=1
https://ginandjuice.shop/blog/post?postId=5
https://ginandjuice.shop/blog/post?postId=2
https://ginandjuice.shop/blog/post?postId=6
https://ginandjuice.shop/blog/post?postId=4
https://ginandjuice.shop/blog/post?postId=3
https://ginandjuice.shop/

Anything else:

The specific part of the code on the '/catalog' page that does not get picked up:

dogancanbakir commented 9 months ago

@filipnyquist, The issue you're experiencing is likely due to the server struggling with multiple concurrent requests. When you set concurrency to 1, the server handles one request at a time, which is more manageable. To resolve this, you can increase the timeout value to give the server more time to respond to each request. This can be done by adjusting the -timeout flag; -timeout 15 worked for me. Remember, the optimal configuration depends on the specific server you're crawling, and you might need to experiment with different settings to find what works best. Let us know if this works for you or if you have any further questions.

dogancanbakir commented 9 months ago

Closing this. Feel free to reopen if the issue persists.

projectdiscovery / katana