haroldtreen / epub-press-clients

📦 Clients for building books with EpubPress.
https://epub.press
GNU General Public License v3.0
583 stars 69 forks source link

How Can I Change The User-Agent of EpubPressJs? #74

Open marcobosch1 opened 3 years ago

marcobosch1 commented 3 years ago

I'm trying to use EpubPressJs (talking to a local server hosted on my machine) to download articles, but I'm getting blocked right off the bat on some websites with super aggressive firewalls.

The same thing was happening when I was using Beautiful Soup (Python scraper), and just changing the user-agent to

Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36

Solved the problem. How can I do the same on EpubPress? Please bear in mind that I'm not a programmer, so if you could ELI5, it'd be very very much appreciated!

P.S. @haroldtreen thank you for building such an awesome application! I can't put into words how awesome it is to read properly formatted articles offline :')

haroldtreen commented 3 years ago

Hey @marcobosch1 - so happy to hear you're enjoying EpubPress!

I think this should be easy to fix if you're running things locally. The requests are being made from the server (not the EpubPressJS) so the change needs to be made there.

Here's a link to where the User Agent is defined on the server side: https://github.com/haroldtreen/epub-press/blob/master/lib/content-downloader.js#L186-L187

You should be able to change that to whatever you want. Hope that helps! 🤞

marcobosch1 commented 3 years ago

Wait, so then I think what I messed up was the "linking EpubPressJS to the local server" part. How exactly do I make EpubPressJS talk to a local server and how do I test it?

Here's what I've done.

First, I followed the instructions to get the EpubPress server up and running, navigated to /epub-press/ on Terminal and ran docker-compose up.

I can see it's working because if I turn off Wi-Fi and navigate to http://localhost:3000/ it shows me a page identical to https://epub.press/. And I can also see the requests being made on the Terminal instance running the server when I access http://localhost:3000/ on my browser.

Then, for troubleshooting purposes, I modified the Chrome extension as specified here to talk to my local server.

I can see that it's talking to the local server because when I use it to download any epub I can also see the requests being made on the Terminal window running the server (and Chrome tells me that that it was downloaded from http://localhost:3000/).

However, when I run EpubPressJS, it downloads the epub but I don't see any requests being made on Terminal. I tried blocking http://epub.press by adding the following to my hosts file:

0.0.0.0     epub.press
0.0.0.0     www.epub.press

And the extension continues to work just fine. But if I try to run EpubPressJS it gives me an error:

Unhandled rejection Error: Server is down. Please try again later.
<plus a list of a bunch of files>

So I don't think EpubPressJS is talking to the local server (but I could be wrong). Here's what I've done to make me think EpubPressJS was talking to the local server:

I changed EpubPress.BASE_URL = packageInfo.baseUrl; to EpubPress.BASE_URL = "http://localhost:3000"; on this line on epub-press.js.

I then created a book.js on /epub-press-clients/packages/epub-press-js/ to run this:

const EpubPress = require('epub-press-js');

const ebook = new EpubPress({
    urls: ['https://example.com']
});

ebook.publish().then(() => { ebook.download() });

But I guess that's not it then... What should I do?

haroldtreen commented 3 years ago

Thanks for all this context - I think I understand what is going on!

To revisit the original problem, it sounds like there's some websites you want to download that require a special User Agent? To solve that problem, you're going to want to:

That should technically solve the original problem.

Alternatively, if you want to use the JS library with your local server you're on the right track. Two options that could work:

Hopefully one of those two options work!

marcobosch1 commented 3 years ago

Thank you for all the help, @haroldtreen! The modified extension is working. But I'm only using the extension for testing purposes, what I really want to do is to use the JS library on the Terminal.

I've made both changes you suggested to EpubPressJS (one at a time, then both at the same time), but I don't think 'EpubPressJS' is talking to my server, for three reasons (correct me if I'm wrong):

First, when I use the modified Chrome extension and generate an epub, things like this:

server_1         | POST /api/v1/books 202 9.379 ms - 18
server_1         | GET /api/v1/books/7jHKNfaNU/status 200 2.303 ms - 49
server_1         | GET /api/v1/books/7jHKNfaNU/status 200 1.413 ms - 46
server_1         | Executing (default): INSERT INTO "Books" ("title","sections","uid","createdAt","updatedAt") VALUES ($1,$2,$3,$4,$5) RETURNING *;
server_1         | verbose: Book Published id=7jHKNfaNU
server_1         | GET /api/v1/books/7jHKNfaNU/status 200 1.651 ms - 34
server_1         | Executing (default): SELECT "id", "title", "sections", "uid", "createdAt", "updatedAt" FROM "Books" AS "Book" WHERE "Book"."uid" = '7jHKNfaNU' LIMIT 1;
server_1         | GET /api/v1/books/7jHKNfaNU/download?filetype=epub 200 5.756 ms - 115265

pop up on the Terminal window running the server. When I use EpubPressJS the epub gets downloaded, but I don't see any of the above on Terminal.

Second, monitoring port 3000 on Wireshark doesn't show any requests being made when using EpubPressJS, but I get many when using the extension.

Third, blocking access to https://epub.press/ via the hosts file doesn't affect the modified Chrome extension, but running EpubPressJS gives me the error I mentioned on my previous message.

What else could be wrong? Maybe something during installation? Here's exactly what I've done:

cd ~
git clone https://github.com/haroldtreen/epub-press-clients.git
cd ~/epub-press-clients/packages/epub-press-js
npm install --save epub-press-js

This creates a /node_modules folder inside /epub-press-js. Then I try to run node book.js (previous message) on ~/epub-press-clients/packages/epub-press-js.

Do I have to modify files elsewhere as well, perhaps on /node_modules/epub-press-js?

haroldtreen commented 3 years ago

Ah - it looks like you are cloning the project but also installing the module? You should be able to just use the package from npm and get this to work - no cloning required.

Here's what I just got working on my machine:

npm install epub-press-js
touch index.js
# paste into index.js
node index.js

index.js

const EpubPress = require('epub-press-js');

EpubPress.BASE_API = 'http://localhost:3000/api/v1';
EpubPress.checkForUpdates().then(console.log).catch(console.error); // Queries localhost

Does that work?

marcobosch1 commented 3 years ago

It worked! Now the requests are being from localhost:3000. You rock, @haroldtreen!!! :-)

However, one thing still doesn't make sense: I'm still getting firewall errors when using the JS library, but not when using the extension.

Why would they be different? Aren't both requests being made from the same server (my local server) and with the same user-agent and everything, just with different "triggers"?

See if you can reproduce it. Using the extension on this page works perfectly. But when using EpubPressJS it generates an epub like this:

Image link

I've tried different user-agents but this still persists. Do you know how can I fix this?

haroldtreen commented 3 years ago

Huzzah! Good to hear.

Not sure what would cause the website to return that response, but the difference might be the result of how the pages are sent?

With the Chrome Extension - it's basically reaching into all the tabs and taking the content that is currently rendered and sending that to the server. This is great because it means you can publish anything that you're able to load into a tab.

With the Javascript library in Node, you don't get the benefits of a browser that's pre-rendered the page. You can either send links, that will be fetched by the server or you can send your own blob of HTML.

// Links
const ebook = new EpubPress({
    title: 'Best of HackerNews',
    description: 'Favorite articles from HackerNews in May, 2016',
    urls: [
        'http://medium.com/@techBlogger/why-js-is-dead-long-live-php'
    ]
});

// HTML
const ebook = new EpubPress({
    title: 'Best of HackerNews',
    description: 'Favorite articles from HackerNews in May, 2016',
    sections: [
        {
            url: 'http://medium.com/@techBlogger/why-javascript-is-dead-long-live-php',
            html: '<html><body><p>Lulz.</p></body></html>',
        }
    ]
});

When you send a link, the server needs to request all the content. This is less reliable because the page could be just be a bunch of scripts that run and build the page client side, or the server might be able to tell you're not a browser and reject you. When requesting that page I see there's some other Headers sent besides just the User-Agent. I imagine EpubPress doesn't send all this and maybe that's causing you to be blocked:

image

If you go the HTML blob route, you can get creative with how you get the HTML and then just send it to be packaged to EpubPress. For example, you could use puppeteer to visit the page, pull the HTML and then send it with EpubPress JS.

Websites are all very different, so hard to have something that works for everything. Hopefully that gives some more insight and ideas for jumping off points!

marcobosch1 commented 3 years ago

Thanks for all the help again, @haroldtreen.

I've spent the last week or so trying to make puppeteer work, and I've managed to download the entire HTML code from a webpage. I have two questions, though:

  1. I'm saving the output from puppeteer to a webpage.html file. How do I import the file instead of pasting the code on index.js? And if I want to create, say, 10 different ebooks from 10 different files, how would I do it?
  2. Can I pass the entire HTML code, or do I have to pass only the content inside <body> tags? I've tried pasting the entire HTML from the page above into index.js but it gave me an error...

EDIT: I think the error above is caused by the formatting/line breaks on the HTML. I assumed it'd be a bad idea to just paste it and that's why I wanted to import from a file... 😅 What would be the best way to import the code into index.js?

I hope this is not too annoying for you. Sorry to keep bothering!

haroldtreen commented 3 years ago

Ah - congrats on making progress. That's nice you've managed to run puppeteer and save HTML files locally.

Once you have the HTML downloaded to a file, you could probably use something like fs.readFileSync('webpage.html', 'utf8') to read the file and pass it to EpubPressJS.

const ebook = new EpubPress({
    title: 'Best of HackerNews',
    description: 'Favorite articles from HackerNews in May, 2016',
    sections: [
        {
            url: 'http://medium.com/@techBlogger/why-javascript-is-dead-long-live-php',
            html: fs.readFileSync('webpage.html', 'utf8'),
        }
    ]
});

If you wanted to do this for multiple files, you could maybe do something like this:

const FILES = [
{ path: 'webpage1.html', url: 'http://webpage1.com' },
{ path: 'webpage2.html', url: 'http://webpage2.com' },
//etc
];

FILES.forEach((file, index) => {
     const ebook = new EpubPress({  
         title: 'Webpage ' + i, 
         sections: [{ url: file.url, html: fs.readFileSync(file.path) }]
      });
     ebook.publish().then(() => ebook.download());
})

As for what to pass, it looks like the chrome extension gets the document outerHTML: https://github.com/haroldtreen/epub-press-clients/blob/master/packages/epub-press-chrome/scripts/browser.js#L56. That's the whole thing including the <HTML>. Pasting straight HTML sounds messy - so definitely think you'll have more luck reading it in. Fingers crossed! 🤞