projectdiscovery / katana

A next-generation crawling and spidering framework.
MIT License
10.99k stars 583 forks source link

Headless mode missing rendered content if concurrency is more than 1 #505

Closed filipnyquist closed 9 months ago

filipnyquist commented 1 year ago

katana version:

[INF] Current katana version v1.0.2 (latest)

Current Behavior:

While scanning sites in headless mode, using the default concurrency of 10, some URLs are not picked up from pages as the page isn´t in "view" by for example the React DOM, which only seems to render it when the page is "viewed".

Expected Behavior:

In headless mode, the page should be fully viewed for all links and allow the functions to run before moving on to next URL.

Steps To Reproduce:

  1. Run katana -u https://ginandjuice.shop/catalog -hl (an example page which renders this with ReactDOM, see attatched code in "Anything else".
  2. Note the output:
    [INF] Current katana version v1.0.2 (latest)
    [INF] Started headless crawling for => https://ginandjuice.shop/catalog
    https://ginandjuice.shop/resources/js/subscribeNow.js
    https://ginandjuice.shop/resources/js/angular_1-7-7.js
    https://ginandjuice.shop/resources/js/react.development.js
    https://ginandjuice.shop/resources/css/labsScanme.css
    https://ginandjuice.shop/resources/footer/js/scanme.js
    https://ginandjuice.shop/resources/css/labsEcommerce.css
    https://ginandjuice.shop/resources/labheader/css/scanMeHeader.css
    https://ginandjuice.shop/resources/js/stockCheck.js
    https://ginandjuice.shop/resources/js/xmlStockCheckPayload.js
    https://ginandjuice.shop/resources/js/react-dom.development.js
    https://ginandjuice.shop/catalog
    https://ginandjuice.shop/catalog/product?productId=12
    https://ginandjuice.shop/catalog/product?productId=10
    https://ginandjuice.shop/catalog/product?productId=9
    https://ginandjuice.shop/catalog/product?productId=11
    https://ginandjuice.shop/catalog/product?productId=8
    https://ginandjuice.shop/catalog/product?productId=7
    https://ginandjuice.shop/catalog/product?productId=5
    https://ginandjuice.shop/catalog/product?productId=6
    https://ginandjuice.shop/catalog/product?productId=4
    https://ginandjuice.shop/catalog/product?productId=2
    https://ginandjuice.shop/catalog/cart
    https://ginandjuice.shop/resources/js/searchLogger.js
    https://ginandjuice.shop/catalog/product?productId=1
    https://ginandjuice.shop/catalog/product?productId=3
    https://ginandjuice.shop/my-account
    https://ginandjuice.shop/resources/js/deparam.js
    https://ginandjuice.shop/resources/css/labsBlog.css
    https://ginandjuice.shop/about
    https://ginandjuice.shop/blog
    https://ginandjuice.shop/catalog
    https://ginandjuice.shop/
    https://ginandjuice.shop/blog/post?postId=1
    https://ginandjuice.shop/blog/post?postId=5
    https://ginandjuice.shop/blog/post?postId=2
    https://ginandjuice.shop/blog/post?postId=6
    https://ginandjuice.shop/blog/post?postId=4
    https://ginandjuice.shop/blog/post?postId=3
  3. Run the headless crawl again, but with a concurrency of one (making use of only one tab to crawl): katana -u https://ginandjuice.shop/catalog -hl -c 1
  4. Note the output and the difference between the found items:
    [INF] Current katana version v1.0.2 (latest)
    [INF] Started headless crawling for => https://ginandjuice.shop/catalog
    https://ginandjuice.shop/catalog
    https://ginandjuice.shop/resources/footer/js/scanme.js
    https://ginandjuice.shop/resources/js/subscribeNow.js
    https://ginandjuice.shop/catalog?category=Juice
    https://ginandjuice.shop/catalog?category=Gin
    https://ginandjuice.shop/catalog?category=Books
    https://ginandjuice.shop/catalog/
    https://ginandjuice.shop/catalog?category=Accompaniments
    https://ginandjuice.shop/catalog?category=Accessories
    https://ginandjuice.shop/resources/js/angular_1-7-7.js
    https://ginandjuice.shop/resources/js/react-dom.development.js
    https://ginandjuice.shop/resources/js/react.development.js
    https://ginandjuice.shop/resources/css/labsScanme.css
    https://ginandjuice.shop/resources/css/labsEcommerce.css
    https://ginandjuice.shop/resources/labheader/css/scanMeHeader.css
    https://ginandjuice.shop/catalog/product?productId=12
    https://ginandjuice.shop/catalog/product?productId=11
    https://ginandjuice.shop/catalog/product?productId=10
    https://ginandjuice.shop/resources/js/stockCheck.js
    https://ginandjuice.shop/resources/js/xmlStockCheckPayload.js
    https://ginandjuice.shop/catalog/product?productId=9
    https://ginandjuice.shop/catalog/product?productId=8
    https://ginandjuice.shop/catalog/product?productId=7
    https://ginandjuice.shop/catalog/product?productId=6
    https://ginandjuice.shop/catalog/product?productId=5
    https://ginandjuice.shop/catalog/product?productId=4
    https://ginandjuice.shop/catalog/product?productId=3
    https://ginandjuice.shop/catalog/product?productId=2
    https://ginandjuice.shop/catalog/product?productId=1
    https://ginandjuice.shop/catalog/cart
    https://ginandjuice.shop/my-account
    https://ginandjuice.shop/about
    https://ginandjuice.shop/blog
    https://ginandjuice.shop/catalog
    https://ginandjuice.shop/resources/js/searchLogger.js
    https://ginandjuice.shop/resources/js/deparam.js
    https://ginandjuice.shop/resources/css/labsBlog.css
    https://ginandjuice.shop/blog/post?postId=1
    https://ginandjuice.shop/blog/post?postId=5
    https://ginandjuice.shop/blog/post?postId=2
    https://ginandjuice.shop/blog/post?postId=6
    https://ginandjuice.shop/blog/post?postId=4
    https://ginandjuice.shop/blog/post?postId=3
    https://ginandjuice.shop/

    Anything else:

    The specific part of the code on the '/catalog' page that does not get picked up: image

dogancanbakir commented 9 months ago

@filipnyquist, The issue you're experiencing is likely due to the server struggling with multiple concurrent requests. When you set concurrency to 1, the server handles one request at a time, which is more manageable. To resolve this, you can increase the timeout value to give the server more time to respond to each request. This can be done by adjusting the -timeout flag; -timeout 15 worked for me. Remember, the optimal configuration depends on the specific server you're crawling, and you might need to experiment with different settings to find what works best. Let us know if this works for you or if you have any further questions.

dogancanbakir commented 9 months ago

Closing this. Feel free to reopen if the issue persists.