superfly / fly

Deploy app servers close to your users. Package your app as a Docker image, and launch it in 17 cities with one simple CLI.
https://fly.io
985 stars 48 forks source link

JS renderer on fly #291

Open geshan opened 4 years ago

geshan commented 4 years ago

I wrote a side project which I think is a great fit to try on fly.io.

JS renderer is an online puppeteer service to render pages with javascript (js). Mainly useful for web scraping (not using splash).

At times while scraping web pages you will come across websites or web pages that only render on a browser that renders the loaded javascript. If you curl it or use something like Scrapy, you just end up with not useful HTML.

This project aims to solve that issue with Puppeteer. With Scrapy you can use Splash but it is Scrapy specific and not easy to configure.

This would a great example for fly.

mrkurt commented 4 years ago

This sounds super cool. It's basically a service that executes JS and then returns the resulting DOM? I think with a README about why that's interesting and how it's better if you run it close to certain cities, that's a pretty great example.

I've actually wanted examples of Puppeteer for other stuff:

  1. Screenshots/thumbnails
  2. Lighthouse tests

Could be good for a second example. ;)

geshan commented 4 years ago

Hey @mrkurt , appreciate your fast reply.

It's basically a service that executes JS and then returns the resulting DOM? - Yes you are right :)

Here is the repo with steps on how to get this app running on fly.io - https://github.com/geshan/js-renderer-fly . Let me know what would be the next step(s) to get it on fly-examples. I am open to editing the Readme too.

I can do a screenshot as a service example as the next one. Thanks!

codepope commented 4 years ago

Hi, this is a good start - I have some notes for you -


The opener is talking to an audience that already knows what the problem is, and even what the most common tool

Suggest that it might start up something like

"JavaScript is the bane of a web scrapers life. Scraping is all about extracting data from a web page and JavaScript is there adding content, hiding blocks, moving the DOM around and just reading the HTML from the server is just not enough. What you ideally want is a way to run all that JavaScript on the page so you can see what's left after that. Then you can get down to some serious scraping.

There's tools to do this out there but most have their own compliactions or restrictions that stop them from being used out on the edge. Js-renderer-fly has none of those problems and with Fly, you can deploy to close to your users too."

(Roll in the Uses section here, with a practical example - maybe scrape Instagram data and product a list of pics)

How to deploy it on Fly - move 1 and 2 into prerequisits...

3 - only works if you are logged in the SSH support enabled

5/6 - Run flyctl init - hit return for a app name to be generated (unless there's a name you really want)

You can add flyctl init —dockerfile to skip the picking of the builer

Also, re orgs - first one on the list will be your own org now

Not sure the deploy screenshot adds much - maybe explain the stages briefly? The details aren't added to the fly.toml file, they come from the fly.toml file

11 Not sure what you are saying there.

A tour of commands might be good at this point.... status, restart, pause? Leading into the scale commands and a regions command to put an instance on every continent

A script to do something fun with IG or similar to wrap up completing the task from the start?

geshan commented 4 years ago

@codepope I have made the suggested changes here: https://github.com/geshan/js-renderer-fly/pull/6/files let me know if it is ok, thanks!

codepope commented 4 years ago

Make a branch and merge the changes into that branch. It's difficult the review article/readme content with just patch files.

Some quick notes though. Explain what puppeteer is, or at least link to it. (see previous note on audience).

geshan commented 4 years ago

@codepope merged to master, it can be see here: https://github.com/geshan/js-renderer-fly . I will add a bit ore details about puppeteer soon. If anything else needs to be added, please let me know, thanks!

codepope commented 4 years ago

"Scraping is all about extracting data from a web page and JavaScript is there adding content, hiding blocks, moving the DOM around and just reading the HTML from the server is just not enough." change to "Scraping is all about extracting data from a web page and JavaScript is there adding content, hiding blocks and moving the DOM around. Just reading the HTML from the server is just not enough."

"There are tools to do this out there but most have their own compliacations or restrictions that stop them from being used out on the edge. Js-renderer-fly has none of those problems and with Fly, you can deploy to close to your users too. This is an online puppeteer service to render pages with javascript (js) very useful for web scraping." - pull together to make one para... something like "There are tools to do this out there but most have their own complications or restrictions that stop them from being used out on the edge. Js-renderer-fly has none of those problems and with Fly, you can deploy to close to your users too. At its core, js-renderer-fly is a puppeteer-based service. Puppeteer is a package which renders pages using a headless Chrome instance, executing the JavaScript within the page."

Uses section seems redundant. Maybe blend with the Quick Try....

Explain that a typical Youtube page adds the view count in JavaScript and to get that value, we're going to use js-renderer-fly to pull out that value after the JavaScript has run.

The quick-try ideally should prompt the user at that point to clone the github repo.

Will have to go over it for typos and things like "Then select and org"....

Pull the resources section into the "More Fly Commands" section so you cover lifecycle, vertical scaling and global scaling.

That's about it for now.

geshan commented 4 years ago

Hi @codepope , I have done most of the changes: https://github.com/geshan/js-renderer-fly.

I have done a quick typo and grammar fix with grammarly, thanks for the ping.

The quick-try ideally should prompt the user at that point to clone the github repo. this is the part I am not clear about. So this node script should clone this repo and try to deploy it for the user?

Let me know if more changes are required, thanks!

codepope commented 4 years ago

At that point, reading through, the user will not have downloaded anything. So, you'd likely want to suggest they either grab the script from the repo or clone the repo before discussing the script.

geshan commented 4 years ago

@codepope Just fixed that part too, thanks for the ping it made sense. Latest changes are here: https://github.com/geshan/js-renderer-fly . Open to suggestions.

geshan commented 4 years ago

@codepope feedback is welcome :)

codepope commented 4 years ago

generally youtube -> YouTube

The installer line git clone git@github.com:geshan/js-renderer-fly.git && cd js-renderer-fly && npm install && node yt-views.js is a bit daunting and probably better broken down into separate lines

Also the scraper app is a bit of a black box - maybe a line or two about what it does? (and a mention for the axrios library) so people could use it to kickstart actually writing a scraper.

Run Locally now repeats those instructions too... Suggest make "Quick Try" a "Quick Start" and add a sub section for installing, and a sub section for "Your first scraping" or something. "Use it as a service" should point out that the instructions later will show you how to deploy it as a service and that you are just showing how it works with an already deployed version.

"if you are logged in the SSH support enabled else try" else->otherwise

" I tried with: " -> "I ran it with js-renderer-fly as the app name for the examples"

"Subsequently, you can select an organization. Generally, it will be your first name-last name on the prompt" Generally, it->Usually, this will...

Step 9 is confusing. I think you are trying to get two things over at once. Also it's not clear what the command line should be flyctl open /api/render?url=<your-url> I assume

You may want to give a reason why you would want to suspend the service

"So I wanted to check how much resources were allocated to this app on fly by default. It was easy to know with the following commands" -> "I wanted to see what resources were allocated to the App on Fly. The scale commands allowed me to find out"

"Now your service is running well in one data center for me it was iad which is Ashburn, Virginia (US). Now let's add some more:"

Each sentence starts with Now. "Our service is now running in one data center. For me, it's iad (Ashburn, Virginia) but yours will likely be different based on where you are working from. We can add instances around the world the speed up responses...

geshan commented 4 years ago

@codepope appreciate you taking the time for such detailed feedback, all of it has been updated: https://github.com/geshan/js-renderer-fly . Let me know if it needs any more improvements, thanks!

geshan commented 4 years ago

Just a ping for this @codepope :).

geshan commented 4 years ago

Thanks for the PRs @codepope , both have been merged. What would the next step be?

mrkurt commented 4 years ago

@geshan Have you been emailing with @KittyBot? This looks great, next steps are:

  1. Transfer the repository to me (mrkurt)
  2. We'll work out payment over email.
geshan commented 4 years ago

@mrkurt Yes I am in comms with @KittyBot . I have transferred the repo to you, thanks a lot!

mrkurt commented 4 years ago

Good deal! It's here now, I'll get it in our docs soon: https://github.com/fly-examples/puppeteer-js-renderer

mrkurt commented 4 years ago

Alrighty, now in the docs. I renamed the project and did a little cleanup of project names in the README text, but check it over and see if I missed anything: http://fly.io/docs/app-guides/puppeteer-js-renderer/

Feel free to submit PRs to the project for any changes you want to make.

geshan commented 4 years ago

Looks really nice. There were some references to the old project name and Github URL, I have changed them in this PR, thanks!