Closed 0x4007 closed 8 months ago
If we pass in the pages/
controller/logic directory as a param in a Scraper constructor, and import as an npm package, we should be able to solve the portability issue.
e.g.
import Scraper from "@ubiquity/scraper";
import path from "path";
async () => {
const controllers = path.resolve("pages");
const scrape = new Scraper(controllers);
const url = "https://github.com/ubiquity/scraper/issues/1";
const result = await scrape(url);
return result;
};
I just want to elaborate that the scraper page controller logic should be a separate module so that the scraper core logic can be clean without needing to worry about relative imports.
There's a lot of code I put together to scaffold a solution that will probably need to be adjusted as part of the process of excising the pages/
page controllers and placing in another folder side by side with the scraper core.
We should make a new repo as a parent project to combine the pages and scraper logics together as two separate modules.
scraper should be a proper npm module, while the pages logic can be a simple directory within a parent project.
@pavlovcik would love to work on this, is this open? The major goal is to convert this repo into a module which can used anywhere right?
I think you should just do all the setup necessary to make it work as an npm package and then I can host it under ubiquity etc.
/start
Deadline | Sat, 21 Oct 2023 22:08:04 UTC |
Registered Wallet | 0x7e92476D69Ff1377a8b45176b1829C4A5566653a |
/wallet 0x0000...0000
if you want to update your registered payment wallet address @user.I did a few experiments and was able to publish my fork to npm.js (https://www.npmjs.com/package/@korrrba/scraper-kernel-fork) I will of course delete this soon, just leaving temporarily for QA.
PR: https://github.com/ubiquity/scraper-kernel/pull/11
In order to consume from npmjs.com I was able to build with
package.json
{
"name": "scraper-kernel-typescript-express",
"version": "1.0.0",
"description": "Minimal use case for scraper-kernel",
"main": "src/index.js",
"author": "Korrrba",
"type": "module",
"scripts": {
"start": "ts-node src/index.ts",
"build": "tsc",
"serve": "node dist/index.js"
},
"license": "MIT",
"devDependencies": {
"@types/express": "^4.17.20",
"@types/node": "^20.8.7",
"express": "^4.18.2",
"ts-node": "^10.9.1",
"typescript": "^5.2.2"
},
"dependencies": {
"@korrrba/scraper-kernel-fork": "^0.15.0",
"eslint-import-resolver-custom-alias": "^1.3.2"
}
}
code
import express, { Request, Response } from 'express';
import scrape from '@korrrba/scraper-kernel-fork';
import path from 'path';
const app = express();
const port = process.env.PORT || 3000;
app.get('/', async (req: Request, res: Response) => {
const url = "https://github.com/orgs/surfDB/repositories";
const controllers = path.resolve("pages");
const userSettings: any = { urls: url, pages: controllers};
try {
const response = await scrape(userSettings);
res.send(response);
} catch (error) {
console.log(error);
}
});
$ npm run build
> scraper-kernel-typescript-express@1.0.0 build
> tsc
$ npm run serve
> scraper-kernel-typescript-express@1.0.0 serve
> node dist/index.js
I do not have access to controllers so it was later complaining:
at new Promise (<anonymous>)
at __async (file:///home/korrrba/express-test/scraper-kernel-typescript-express/node_modules/@korrrba/scraper-kernel-fork/dist/scrape.mjs:21:10)
at _searchForImport (file:///home/korrrba/express-test/scraper-kernel-typescript-express/node_modules/@korrrba/scraper-kernel-fork/dist/scrape.mjs:179:10)
at file:///home/korrrba/scraper-kernel-typescript-express/node_modules/@korrrba/scraper-kernel-fork/dist/scrape.mjs:175:18
at Generator.next (<anonymous>)
"/home/korrrba/scraper-kernel-typescript-express/pages/github.com/orgs/surfDB/index.ts" not found
"/home/korrrba/express-test/scraper-kernel-typescript-express/pages/github.com/orgs/surfDB/*" not found
Trace: × requested: /home/korrrba/express-test/scraper-kernel-typescript-express/pages/github.com/orgs/*
...
but this is not scope of this bounty anyway, I hope you will be able to continue on this. NPM package build and deployment looks fine.
A message for the bot: waiting for PR review and merge. Message for the team: can be after the big refactoring merge to ubiquibot repository by @pavlovcik .
Hey thanks for working on this! Crazy that you have 680 downloads?
Since creating this issue, I made it so that "pages" directory can be passed into the kernel from a parent project. Here's an example with probably some outdated submodule code of this kernel.
https://github.com/pavlovcik/scraper-parent-test/tree/main/src/pages
It was designed to
fs.write
at the program top level. In the pages controllers the function signature is something like:
export default async function (browser: Browser, page: Page)
I already use this kernel across several private projects with their own custom pages controllers.
@pavlovcik all right, thank you for showing me the pages location and describing the workflow. Yes, 680 downloads looks very suspicious or the package simply creates interest. I could generate a 2-3 dozen downloads myself during tests but not more.
Perhaps I will check this week integration with https://github.com/pavlovcik/scraper-parent-test/tree/main/src/pages
@pavlovcik good news. I was able to verify this with the scraper-parent-test . It revealed an issue which was to fix the logging part exports. Below a log / demo that it works with NPM module:
$ yarn start https://github.com/gitcoindev
yarn run v1.22.19
$ tsx src -h --chromium="--user-data-dir=cache" --table sandbox https://github.com/gitcoindev
⚙️ {
"headful": true,
"chromium": [
"--user-data-dir=cache"
],
"table": "sandbox",
"urls": [
"https://github.com/gitcoindev"
],
"pages": "src/pages/"
}
✓ /home/korrrba/work/scraper-parent-test/src/metamask found!
>> https://github.com/gitcoindev
"/home/korrrba/work/scraper-parent-test/src/pages/github.com/gitcoindev/index.ts" not found
"/home/korrrba/work/scraper-parent-test/src/pages/github.com/gitcoindev/*" not found
writing to database table sandbox
✓ "/home/korrrba/work/scraper-parent-test/src/pages/github.com/*/index.ts" module loaded successfully
this is a personal profile
⚠ "[aria-label^="Organization:"]" not found
⚠ "[data-test-selector="profile-website-url"]" not found
⚠ "[href*=twitter]" not found
Trying to upsert
{
"tableName": "sandbox"
}
<< [
{
login: 'gitcoindev',
type: 'User',
name: 'korrrba',
company: null,
blog: null,
location: 'Web3 / Europe',
email: null,
bio: null,
twitter_username: null,
_public_repos: '28',
_followers: '2',
_following: '3',
_created_at: '2021',
contributions: '263',
percent_commits: null,
percent_issues: null,
percent_pull_requests: null,
percent_code_reviews: null,
recruited_by: null
}
]
Done in 4.06s.
I will update the PR today and also open an integration pull request to scraper-parent-test . You will not need to merge the scraper-parent-test pull request as is, it will be just to show an example.
I updated the pull request and provided an NPM usage demo on a fork : https://github.com/pavlovcik/scraper-parent-test/pull/4
Pull request https://github.com/ubiquity/scraper-kernel/pull/11 ready for the review.
Permit generation skipped since this issue didn't qualify as bounty
0x7e92476D...A5566653a
Hey @gitcoindev if you're interested in less structured work, please send me a message on Telegram. We could use extra hands on some things!
Hey @gitcoindev if you're interested in less structured work, please send me a message on Telegram. We could use extra hands on some things!
Hi @pavlovcik thank you! I will join the Telegram group and message you.
Can't figure out how to call this through an HTTP interface. I tried making a simple Express project and including this repo as a submodule but all of the relative imports break and I can't figure out how to make this portable.