gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
539 stars 57 forks source link

unxepected browser behaviour when both the urls-file flag and max-parallel-worker flags used #101

Open takingurstuff opened 3 days ago

takingurstuff commented 3 days ago

I was recently crawling a simple site with this software when i encountered an issue of the software halting on the last few pages unxepectedly, the chances of this happeneing is also random as sometimes it downloads properluy but other times it halts completely. I was not using the built binary as i was making modifications: Screenshot 2024-07-02 at 3 21 57 PM Screenshot 2024-07-02 at 3 42 47 PM To recreate: first create an urls file with an odd number of links in it, then place the file into the repo:

https://www.example.com
https://www.example2.com
https://www.example3.com

then run the command in the cloned repo:

./single-file --urls-file=./urls.txt --max-parallel-workers=[any even number here] --browser-headless=false

This issue does not occur in the compiled binaries the only modification to the source code that happened prior to this issue is the addition of smart scrolling:

async function scrollAndClick (
  Page,
  Runtime,
  primaryTargetSelector,
  secondaryTargetSelector,
  clickSelector,
  scrollPause = 2000,
  maxAttempts = 10000000
) {
  let attempts = 0

  while (attempts < maxAttempts) {
    attempts++

    try {
      // Check if the primary target element is present
      const primaryTargetResult = await Runtime.evaluate({
        expression: `!!document.querySelector('${primaryTargetSelector}')`
      })

      if (primaryTargetResult.result.value) {
        console.log('Primary target element found!')
        return { found: true, target: 'primary' }
      }

      // Check if the secondary target element is present
      if (secondaryTargetSelector) {
        const secondaryTargetResult = await Runtime.evaluate({
          expression: `!!document.querySelector('${secondaryTargetSelector}')`
        })

        if (secondaryTargetResult.result.value) {
          console.log('Secondary target element found!')
          return { found: true, target: 'secondary' }
        }
      }

      // Check if the click element is present
      const clickResult = await Runtime.evaluate({
        expression: `!!document.querySelector('${clickSelector}')`
      })

      if (clickResult.result.value) {
        console.log('Click element found, clicking it!')
        await Runtime.evaluate({
          expression: `document.querySelector('${clickSelector}').click()`
        })
        // Pause after clicking
        await new Promise(resolve => setTimeout(resolve, scrollPause))
      }

      // Scroll down
      await Runtime.evaluate({
        expression: 'window.scrollTo(0, document.body.scrollHeight)'
      })

      // Pause after scrolling
      await new Promise(resolve => setTimeout(resolve, scrollPause))
    } catch (error) {
      console.error('Error during scroll and click:', error)
      // Wait a bit before retrying

      await new Promise(resolve => setTimeout(resolve, 1000))
    }
  }

  console.log('Max attempts reached without finding either target element')
  return { found: false }
}

this function is only called after the page is loaded fully

gildas-lormeau commented 2 days ago

Before digging into this problem, I have 3 questions:

takingurstuff commented 2 days ago

The code was not shown in the browser as I called it in the cdp client script not the api script. I edited the options as well inside of the api script. I could provide the edited cdp client code if that would help. The screenshots are what opened by itself during the last few pages. I will try to add the context ID to my function but I am guessing that a helper function that controls scrolling and clicking will cause browser to open a specific local url. Thank you for the suggestion.

gildas-lormeau commented 2 days ago

Don't hesitate to share your code (ideally a repository that I could clone via git) if possible. This would be the easiest way for me to help you debug this problem.

takingurstuff commented 2 days ago

it might also help to state that i am doing this edit on the latest version instead of the previous puppeteer version

takingurstuff commented 2 days ago

i created a fork with the edited files: https://github.com/takingurstuff/single-file-cli

gildas-lormeau commented 20 hours ago

Thank you, I've cloned the repository on my machine and formatted the code to compare it with mine.

Before debugging it, could you confirm that the presence of const LOGIN_PAGE_URL = "https://bbs.quantclass.cn"; and await CDP.createTarget(LOGIN_PAGE_URL); is intentional? I was not expecting to find this code and I'm not sure it's working as intended.

takingurstuff commented 9 hours ago

it is intentional, the code was there so i can open a new page that will not be used for downloading at all, all the downloading happens on new empty targets. It is working and i have not done any runtime evaluation on the page

takingurstuff commented 9 hours ago

just tested thru contextID, and it opened the script straight away instead of openeing it at the end of the download sequence: scroll and click function:

async function scrollAndClick (
  Page,
  Runtime,
  primaryTargetSelector,
  secondaryTargetSelector,
  clickSelector,
  buttonSelector,
  scrollPause = 2000,
  maxAttempts = 10000000,
  contextId
) {
  let attempts = 0

  while (attempts < maxAttempts) {
    attempts++

    try {
      const clickResult2 = await Runtime.evaluate({
        expression: `!!document.querySelector('${buttonSelector}')`,
        contextId
      })

      if (clickResult2.result.value) {
        console.log('preparing to click')
        click(buttonSelector, contextId)
      }
      // Check if the primary target element is present
      const primaryTargetResult = await Runtime.evaluate({
        expression: `!!document.querySelector('${primaryTargetSelector}')`,
        contextId
      })

      if (primaryTargetResult.result.value) {
        console.log('Primary target element found!')
        return { found: true, target: 'primary' }
      }

      // Check if the secondary target element is present
      if (secondaryTargetSelector) {
        const secondaryTargetResult = await Runtime.evaluate({
          expression: `!!document.querySelector('${secondaryTargetSelector}')`,
          contextId
        })

        if (secondaryTargetResult.result.value) {
          console.log('Secondary target element found!')
          return { found: true, target: 'secondary' }
        }
      }

      // Check if the click element is present
      const clickResult = await Runtime.evaluate({
        expression: `!!document.querySelector('${clickSelector}')`,
        contextId
      })

      if (clickResult.result.value) {
        console.log('Click element found, clicking it!')
        await Runtime.evaluate({
          expression: `document.querySelector('${clickSelector}').click()`,
          contextId
        })
        // Pause after clicking
        await new Promise(resolve => setTimeout(resolve, scrollPause))
      }

      // Scroll down
      await Runtime.evaluate({
        expression: 'window.scrollTo(0, document.body.scrollHeight)',
        contextId
      })

      // Pause after scrolling
      await new Promise(resolve => setTimeout(resolve, scrollPause))
    } catch (error) {
      console.error('Error during scroll and click:', error)
      // Wait a bit before retrying
      await new Promise(resolve => setTimeout(resolve, 1000))
    }
  }

  console.log('Max attempts reached without finding either target element')
  return { found: false }
}

async function click (button, contextId) {
  try {
    await Runtime.evaluate({
      expression: `!!document.querySelector('${button}').click()`,
      contextId
    })
    console.log('there is a button to click')
  } catch (error) {
    console.log('there are no buttons to click')
  }
}

and the calling of the function:

    if (options.scrollAndClickTarget && options.scrollAndClickButton) {
      await scrollAndClick(
        Page,
        Runtime,
        options.scrollAndClickTarget,
        options.secondaryScrollAndClickTarget,
        options.scrollAndClickButton,
        options.nonScrollButtonSelector,
        options.scrollPause || 2000,
        options.scrollMaxAttempts || 100,
        contextId
      )
    }
takingurstuff commented 8 hours ago

all the changes have been committed to the fork