ubiquity / scraper-kernel

A Puppeteer-based scraping platform with modular, page-level scraping logic.
0 stars 3 forks source link

Convert into NPM Package #1

Closed 0x4007 closed 8 months ago

0x4007 commented 2 years ago

Can't figure out how to call this through an HTTP interface. I tried making a simple Express project and including this repo as a submodule but all of the relative imports break and I can't figure out how to make this portable.

0x4007 commented 1 year ago

If we pass in the pages/ controller/logic directory as a param in a Scraper constructor, and import as an npm package, we should be able to solve the portability issue.

e.g.

import Scraper from "@ubiquity/scraper";
import path from "path";
async () => {
    const controllers = path.resolve("pages");
    const scrape = new Scraper(controllers);
    const url = "https://github.com/ubiquity/scraper/issues/1";
    const result = await scrape(url);
    return result;
};
0x4007 commented 1 year ago

I just want to elaborate that the scraper page controller logic should be a separate module so that the scraper core logic can be clean without needing to worry about relative imports.

There's a lot of code I put together to scaffold a solution that will probably need to be adjusted as part of the process of excising the pages/ page controllers and placing in another folder side by side with the scraper core.

We should make a new repo as a parent project to combine the pages and scraper logics together as two separate modules.

scraper should be a proper npm module, while the pages logic can be a simple directory within a parent project.

b4s36t4 commented 1 year ago

@pavlovcik would love to work on this, is this open? The major goal is to convert this repo into a module which can used anywhere right?

0x4007 commented 12 months ago

I think you should just do all the setup necessary to make it work as an npm package and then I can host it under ubiquity etc.

gitcoindev commented 8 months ago

/start

ubiquibot[bot] commented 8 months ago

Deadline Sat, 21 Oct 2023 22:08:04 UTC
Registered Wallet 0x7e92476D69Ff1377a8b45176b1829C4A5566653a

Tips:

gitcoindev commented 8 months ago

I did a few experiments and was able to publish my fork to npm.js (https://www.npmjs.com/package/@korrrba/scraper-kernel-fork) I will of course delete this soon, just leaving temporarily for QA.

PR: https://github.com/ubiquity/scraper-kernel/pull/11

In order to consume from npmjs.com I was able to build with

package.json

{
  "name": "scraper-kernel-typescript-express",
  "version": "1.0.0",
  "description": "Minimal use case for scraper-kernel",
  "main": "src/index.js",
  "author": "Korrrba",
  "type": "module",
  "scripts": {
    "start": "ts-node src/index.ts",
    "build": "tsc",
    "serve": "node dist/index.js"
  },
  "license": "MIT",
  "devDependencies": {
    "@types/express": "^4.17.20",
    "@types/node": "^20.8.7",
    "express": "^4.18.2",
    "ts-node": "^10.9.1",
    "typescript": "^5.2.2"
  },
  "dependencies": {
    "@korrrba/scraper-kernel-fork": "^0.15.0",
    "eslint-import-resolver-custom-alias": "^1.3.2"
  }
} 

code

import express, { Request, Response } from 'express';
import scrape from '@korrrba/scraper-kernel-fork';
import path from 'path';

const app = express();
const port = process.env.PORT || 3000;

app.get('/', async (req: Request, res: Response) => {
  const url = "https://github.com/orgs/surfDB/repositories";
  const controllers = path.resolve("pages");

  const userSettings: any = { urls: url, pages: controllers};

  try {
    const response = await scrape(userSettings);
    res.send(response);  
  } catch (error) {
    console.log(error);
  }

});
$ npm run build

> scraper-kernel-typescript-express@1.0.0 build
> tsc

$ npm run serve

> scraper-kernel-typescript-express@1.0.0 serve
> node dist/index.js

I do not have access to controllers so it was later complaining:

    at new Promise (<anonymous>)
    at __async (file:///home/korrrba/express-test/scraper-kernel-typescript-express/node_modules/@korrrba/scraper-kernel-fork/dist/scrape.mjs:21:10)
    at _searchForImport (file:///home/korrrba/express-test/scraper-kernel-typescript-express/node_modules/@korrrba/scraper-kernel-fork/dist/scrape.mjs:179:10)
    at file:///home/korrrba/scraper-kernel-typescript-express/node_modules/@korrrba/scraper-kernel-fork/dist/scrape.mjs:175:18
    at Generator.next (<anonymous>)
          "/home/korrrba/scraper-kernel-typescript-express/pages/github.com/orgs/surfDB/index.ts" not found
          "/home/korrrba/express-test/scraper-kernel-typescript-express/pages/github.com/orgs/surfDB/*" not found
Trace:  × requested: /home/korrrba/express-test/scraper-kernel-typescript-express/pages/github.com/orgs/*
...

but this is not scope of this bounty anyway, I hope you will be able to continue on this. NPM package build and deployment looks fine.

gitcoindev commented 8 months ago

A message for the bot: waiting for PR review and merge. Message for the team: can be after the big refactoring merge to ubiquibot repository by @pavlovcik .

0x4007 commented 8 months ago

Hey thanks for working on this! Crazy that you have 680 downloads?

Since creating this issue, I made it so that "pages" directory can be passed into the kernel from a parent project. Here's an example with probably some outdated submodule code of this kernel.

https://github.com/pavlovcik/scraper-parent-test/tree/main/src/pages

It was designed to

  1. start Chrome
  2. install metamask
  3. load the metamask cache from another submodule (restore consistent state)
  4. interact with uad.ubq.fi for an end-to-end test with metamask connected.
  5. return anything which will be received by a console.log or fs.write at the program top level.

In the pages controllers the function signature is something like:

export default async function (browser: Browser, page: Page)

I already use this kernel across several private projects with their own custom pages controllers.

gitcoindev commented 8 months ago

@pavlovcik all right, thank you for showing me the pages location and describing the workflow. Yes, 680 downloads looks very suspicious or the package simply creates interest. I could generate a 2-3 dozen downloads myself during tests but not more.

gitcoindev commented 8 months ago

Perhaps I will check this week integration with https://github.com/pavlovcik/scraper-parent-test/tree/main/src/pages

gitcoindev commented 8 months ago

@pavlovcik good news. I was able to verify this with the scraper-parent-test . It revealed an issue which was to fix the logging part exports. Below a log / demo that it works with NPM module:

$ yarn start https://github.com/gitcoindev
yarn run v1.22.19
$ tsx src -h --chromium="--user-data-dir=cache" --table sandbox https://github.com/gitcoindev
        ⚙️ {
             "headful": true,
             "chromium": [
               "--user-data-dir=cache"
             ],
             "table": "sandbox",
             "urls": [
               "https://github.com/gitcoindev"
             ],
             "pages": "src/pages/"
           }
        ✓ /home/korrrba/work/scraper-parent-test/src/metamask found!
        >> https://github.com/gitcoindev
          "/home/korrrba/work/scraper-parent-test/src/pages/github.com/gitcoindev/index.ts" not found
          "/home/korrrba/work/scraper-parent-test/src/pages/github.com/gitcoindev/*" not found
          writing to database table sandbox
        ✓ "/home/korrrba/work/scraper-parent-test/src/pages/github.com/*/index.ts" module loaded successfully
          this is a personal profile
        ⚠ "[aria-label^="Organization:"]" not found
        ⚠ "[data-test-selector="profile-website-url"]" not found
        ⚠ "[href*=twitter]" not found
          Trying to upsert
          {
            "tableName": "sandbox"
          }
        << [
             {
               login: 'gitcoindev',
               type: 'User',
               name: 'korrrba',
               company: null,
               blog: null,
               location: 'Web3 / Europe',
               email: null,
               bio: null,
               twitter_username: null,
               _public_repos: '28',
               _followers: '2',
               _following: '3',
               _created_at: '2021',
               contributions: '263',
               percent_commits: null,
               percent_issues: null,
               percent_pull_requests: null,
               percent_code_reviews: null,
               recruited_by: null
             }
           ]
Done in 4.06s.

I will update the PR today and also open an integration pull request to scraper-parent-test . You will not need to merge the scraper-parent-test pull request as is, it will be just to show an example.

gitcoindev commented 8 months ago

I updated the pull request and provided an NPM usage demo on a fork : https://github.com/pavlovcik/scraper-parent-test/pull/4

gitcoindev commented 8 months ago

Pull request https://github.com/ubiquity/scraper-kernel/pull/11 ready for the review.

ubiquibot[bot] commented 8 months ago

Permit generation skipped since this issue didn't qualify as bounty

If you've enjoyed your experience in the DevPool, we'd appreciate your support. Follow Ubiquity on GitHub and star this repo. Your endorsement means the world to us and helps us grow!
We are excited to announce that the DevPool and UbiquiBot are now available to partners! Our ideal collaborators are globally distributed crypto-native organizations, who actively work on open source on GitHub, and excel in research & development. If you can introduce us to the repository maintainers in these types of companies, we have a special bonus in store for you!

ubiquibot[bot] commented 8 months ago

Task Assignee Reward

[ CLAIM 100 WXDAI ]

0x7e92476D...A5566653a

If you've enjoyed your experience in the DevPool, we'd appreciate your support. Follow Ubiquity on GitHub and star this repo. Your endorsement means the world to us and helps us grow!
We are excited to announce that the DevPool and UbiquiBot are now available to partners! Our ideal collaborators are globally distributed crypto-native organizations, who actively work on open source on GitHub, and excel in research & development. If you can introduce us to the repository maintainers in these types of companies, we have a special bonus in store for you!

ubiquibot[bot] commented 8 months ago

Task Creator Reward

pavlovcik: [ CLAIM 12.6 WXDAI ]

0x4007 commented 8 months ago

Hey @gitcoindev if you're interested in less structured work, please send me a message on Telegram. We could use extra hands on some things!

gitcoindev commented 8 months ago

Hey @gitcoindev if you're interested in less structured work, please send me a message on Telegram. We could use extra hands on some things!

Hi @pavlovcik thank you! I will join the Telegram group and message you.