rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
520 stars 105 forks source link

Login before save #124

Closed tadam98s closed 4 months ago

tadam98s commented 4 months ago

Hi,

I have a website running on a docker that is accesses locally: http://localhost:9000/dashboard?id=face-animation

I need to login with two fields: login and password.

config = get_config('http://localhost:9000/dashboard?id=face-animation')
wp = config.create_page()
wp.get(config['project_url'])
form = wp.get_forms()[0]
form.inputs['login'].value = 'my_user' # etc
form.inputs['password'].value = 'my_password' # etc
wp.submit_form(form)
wp.get_links()

When I run it I get on wp.get(config['project_url']):

Exception has occurred: KeyError
Exception has occurred: UrlDisallowed
Access to ['http://localhost:9000/dashboard?id=face-animation'] disallowed by the Session rules.
  File "D:\download\tests\scripts\clone_test.py", line 10, in <module>
    wp.get(config['project_url'])
pywebcopy.session.UrlDisallowed: Access to ['http://localhost:9000/dashboard?id=face-animation'] disallowed by the Session rules.

How do I write the code to save this website ? When I login to the site it creates two cookies: JWT-SESSION XSRF-TOKEN Which I need to carry on into the pywebsave

rajatomar788 commented 4 months ago

You need to pass bypass_robots=True to the get_config function. The error states that your local website has robots.txt rule which prohibits bot or script access. It can be just bypassed using the arguments above.

tadam98s commented 4 months ago

config = get_config(url,bypass_robots=True) wp = config.create_page() wp.get(config['project_url']) form = wp.get_forms()[0]

Exception has occurred: IndexError list index out of range File "D:\download\test\scripts\clone_test.py", line 21, in form = wp.get_forms()[0] IndexError: list index out of range

rajatomar788 commented 4 months ago

You need to verify whether their are forms before applying [0] index. Common sense yaar. Check the url property of the wp object before hand whether their wasn't any redirects. Then check the available forms using get_forms method.

tadam98s commented 4 months ago

print(wp.url) http://localhost:9000

tadam98s commented 4 months ago

when I open the site manually I get: image

tadam98s commented 4 months ago

if I login manually, can I manually copy the cookies and pass to pywebsave? Apparently, there is a java script that shows the login/password form. Is there a way to provide it with the answers programatically? or continue after manual login ?

If is only showing the spinner and the java code /js/outBIMYN2XL.js that opens the login/password does not appear to be executed.

<!DOCTYPE html>
<html lang="en">

<head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <link rel="apple-touch-icon" href="/apple-touch-icon.png">
    <link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png">
    <link rel="apple-touch-icon" sizes="60x60" href="/apple-touch-icon-60x60.png">
    <link rel="apple-touch-icon" sizes="72x72" href="/apple-touch-icon-72x72.png">
    <link rel="apple-touch-icon" sizes="76x76" href="/apple-touch-icon-76x76.png">
    <link rel="apple-touch-icon" sizes="114x114" href="/apple-touch-icon-114x114.png">
    <link rel="apple-touch-icon" sizes="120x120" href="/apple-touch-icon-120x120.png">
    <link rel="apple-touch-icon" sizes="144x144" href="/apple-touch-icon-144x144.png">
    <link rel="apple-touch-icon" sizes="152x152" href="/apple-touch-icon-152x152.png">
    <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon-180x180.png">
    <link rel="icon" type="image/x-icon" href="/favicon.ico">
    <meta name="application-name" content="test" />
    <meta name="msapplication-TileColor" content="#FFFFFF" />
    <meta name="msapplication-TileImage" content="/mstile-512x512.png" />
    <title>test</title>

    <link rel="stylesheet" href="/js/outWHCP76XN.css" />
</head>

<body>
    <div id="content" data-base-url="" data-server-status="UP" data-instance="test" data-official="true">
        <div class="global-loading">
            <i class="spinner global-loading-spinner"></i>
            <span aria-live="polite" class="global-loading-text">Loading...</span>
        </div>
    </div>

    <script type="module" src="/js/outBIMYN2XL.js"></script>
</body>

</html>
rajatomar788 commented 4 months ago

Just login with your browser and then copy the cookies to the pywebcopy session headers.

tadam98s commented 4 months ago

kindly show me where these session headers are. -- Mickey Cohen Shanit Ltd. CEO POB 23410, Jerusalem 9123302, Israel M: +972-54-758-6312 Skype: tadam_98 @.***

On April 5, 2024 2:51:53 PM GMT+03:00, Raja Tomar @.***> wrote:

Just login with your browser and then copy the cookies to the pywebcopy session headers.

-- Reply to this email directly or view it on GitHub: https://github.com/rajatomar788/pywebcopy/issues/124#issuecomment-2039605812 You are receiving this because you authored the thread.

Message ID: @.***>

rajatomar788 commented 4 months ago

You can access the session using the .session attribute of the wp object that you created. Then use .headers attribute of the session to set the headers including cookies. The session object is a requests library session. You can read up online how to manage a requests.Session object.

tadam98s commented 4 months ago

URLmain = "http://localhost:9000/" session = requests.session() my_cookies = {'JWT-SESSION': 'some value', 'XSRF-TOKEN': 'some value'} r = requests.post(URLmain, cookies=my_cookies)

This sets the cookies correctly. But the next save_website is not tied to this session at it gets the url as a parameter not the session. How do I connect this session to the following save_website ?

rajatomar788 commented 4 months ago

Use the wp object style approch as you did in the start. Use wp.get methods to open pages. Then the session would remain same for all the requests.

tadam98s commented 4 months ago

I have built a url that sets cookies. url_cookies = f"{url}/cookies/set?JWT-SESSiON={my_cookies['JWT-SESSiON']}?XSRF-TOKEN={my_cookies['XSRF-TOKEN']}"

Then I used get_config to start a session config = get_config(url, project_folder, project_name=projectName, bypass_robots=True, debug=True, delay=None, threaded=False)

Then I was not sure how to use the usl_cookies to set the cookies. I tried:

crawler = config.create_crawler()
crawler.get(url_cookies)
crawler.get(url)
crawler.save_complete(pop=True)

But this did not set the cookies. Not sure how to use the wp as if I have the cookies not sure I need to open a form. Please avise.

tadam98s commented 4 months ago

Anyway, I maybe copying the cookies may not work as they are JWT and could be associated with some seed in each instance. I may be back to the question, how to login.

rajatomar788 commented 4 months ago

You may have proceed with trial and error method. It is understood that there is no javascript support in the pywebcopy. So each javascript based site would require some different approach to get around. At the moment I can only tell you to see the requests.Session usage and documentation. Because cookies and auth is handled by that quite capable library.