seleniumbase / SeleniumBase

📊 Python's all-in-one framework for web crawling, scraping, testing, and reporting. Supports pytest. UC Mode provides stealth. Includes many tools.
https://seleniumbase.io
MIT License
5.19k stars 962 forks source link

PDF loading in tab problems #718

Closed bschollnick closed 3 years ago

bschollnick commented 3 years ago

Sorry, I've been trying to figure this out over the last two days, and I'm hitting a brick wall.

Scenario:

I can't look for a visible element, or text, since there is no web page? Am I missing something?

I need to take a screen shot of the loaded page, and usually it's loaded after a few seconds, but I can't guarantee it.

Any suggestions?

mdmintz commented 3 years ago

Hi @bschollnick, There are examples that use SeleniumBase PDF methods:

The methods used are:

    def get_pdf_text(self, pdf, page=None, maxpages=None,
                     password=None, codec='utf-8', wrap=False, nav=False,
                     override=False):
        """ Gets text from a PDF file.
            PDF can be either a URL or a file path on the local file system.
            @Params
            pdf - The URL or file path of the PDF file.
            page - The page number (or a list of page numbers) of the PDF.
                    If a page number is provided, looks only at that page.
                        (1 is the first page, 2 is the second page, etc.)
                    If no page number is provided, returns all PDF text.
            maxpages - Instead of providing a page number, you can provide
                       the number of pages to use from the beginning.
            password - If the PDF is password-protected, enter it here.
            codec - The compression format for character encoding.
                    (The default codec used by this method is 'utf-8'.)
            wrap - Replaces ' \n' with ' ' so that individual sentences
                   from a PDF don't get broken up into seperate lines when
                   getting converted into text format.
            nav - If PDF is a URL, navigates to the URL in the browser first.
                  (Not needed because the PDF will be downloaded anyway.)
            override - If the PDF file to be downloaded already exists in the
                       downloaded_files/ folder, that PDF will be used
                       instead of downloading it again. """

AND

    def assert_pdf_text(self, pdf, text, page=None, maxpages=None,
                        password=None, codec='utf-8', wrap=True, nav=False,
                        override=False):
        """ Asserts text in a PDF file.
            PDF can be either a URL or a file path on the local file system.
            @Params
            pdf - The URL or file path of the PDF file.
            text - The expected text to verify in the PDF.
            page - The page number of the PDF to use (optional).
                    If a page number is provided, looks only at that page.
                        (1 is the first page, 2 is the second page, etc.)
                    If no page number is provided, looks at all the pages.
            maxpages - Instead of providing a page number, you can provide
                       the number of pages to use from the beginning.
            password - If the PDF is password-protected, enter it here.
            codec - The compression format for character encoding.
                    (The default codec used by this method is 'utf-8'.)
            wrap - Replaces ' \n' with ' ' so that individual sentences
                   from a PDF don't get broken up into seperate lines when
                   getting converted into text format.
            nav - If PDF is a URL, navigates to the URL in the browser first.
                  (Not needed because the PDF will be downloaded anyway.)
            override - If the PDF file to be downloaded already exists in the
                       downloaded_files/ folder, that PDF will be used
                       instead of downloading it again. """
bschollnick commented 3 years ago

Sorry, I wasn't clear...

The main issue is that maybe a quarter of the time, the screenshot goes off too early.

Is there a way to say wait until this javascript has completed, before returning?

mdmintz commented 3 years ago

@bschollnick You might be able to use self.wait_for_ready_state_complete() to wait for the readyState of the page to be complete, but the PDF loading could theoretically be outside that context. If you knew exactly what JS you needed to run to detect completion of the PDF loading, you might be able to use self.execute_script(JAVASCRIPT), but that's probably trickier than using self.wait_for_ready_state_complete(). There's also the primitive self.sleep(SECONDS) if absolutely necessary.

bschollnick commented 3 years ago

I handled it differently. I think it was purely a timing issue, and I ended up using a combination of "is visible" on the previous page to try to ensure that we were in a stable state.