j0k3r / graby

Graby helps you extract article content from web pages
MIT License
365 stars 74 forks source link

Support for websites with login page in two steps #326

Open Merinorus opened 1 year ago

Merinorus commented 1 year ago

Hello,

I'm trying to make a site config for the website https://elucid.media/. So far I have written this custom config elucid.media.txt:

title: //h1[contains(@class, 'single-title')]
body: //section[contains(@id, 'article-content')]
author: substring-after(//a[contains(@class, 'article-meta_author')], 'Par ')

# wallabag-specific login directives (not supported in FTR)
requires_login: yes
login_uri: https://compte.elucid.media/elucid/connexion/password
login_username_field: email
login_password_field: password

not_logged_in_xpath: //a[contains(@class, 'call-to-subscribe-content_button')]|//button[text()[contains(.,"S'identifier")]]|//a[contains(@href, 'compte.elucid.media/elucid/connexion')]

test_url: https://elucid.media/environnement/transition-energetique-la-chimere-de-l-hydrogene-vert/

The problem with this website is that the login page is in two steps, contrary to all the other websites I saw:

  1. https://compte.elucid.media/elucid/connexion: asks for email address only, then click on "Continue",
  2. https://compte.elucid.media/elucid/connexion/password: asks for password only (email field is already filled and greyed out), then click on "Login".

I tried to put the second URL directly in the config since the second URL would automatically return to the first one. It could use sort of a loop :

  1. Detect that the user is not logged
  2. Go to the login URL, the website redirects to URL n°1, enter your email address (the password field is absent)
  3. Redirect to the article
  4. Detect again that the user is not logged
  5. Go again to the login URL, the website doesn't redirect since the email address is already filled (probably stored in the user session, I didn't check), now Graby can enter the password
  6. Should be logged now.

Unfortunately, I guess this behavior needs some rework on the code and the website config file is not enough to handle this particular case. I could access Graby's logs in Wallabag, but nothing related to the login is logged.

If by any chance someone knows other websites with two-step login pages that work with Graby, I would be happy to have a look. Otherwise, if it's not possible, I might try to help but my PHP skills are very limited.

Thank you anyway for this nice project!

j0k3r commented 1 year ago

There is more and more website doing that behavior. That's a good point and it's not supported at the moment.

briankaemingk commented 9 months ago

Yes, indeed. I landed here because https://www.wsj.com/ now has this two-step behavior as well.

Thank you for bringing this up, @Merinorus and thank you for your work, @j0k3r