ulixee / hero

The web browser built for scraping
MIT License
805 stars 41 forks source link

API to send logs to core database (edited title) #157

Open GlenDC opened 2 years ago

GlenDC commented 2 years ago

As far as I understand the Secret-Agent code, and I mind you that I'm still in the process of fully grasping the entire codebase, logs from all log levels end up at:

https://github.com/ulixee/secret-agent/blob/db3f3535636707ed2fd9c7f5dcc29f7d4bd07cc7/commons/Logger.ts#L47-L103

The problem with this is that from a light client there is no way for me to really capture those logs and send them back for my own purposes. Let me show you, if you will, how I currently make use of Secret-Agent.

I create my core as follows:

import Core from '@secret-agent/core';

await Core.start({
        coreServerPort: port,
});

For each use of the core I create a new agent as follows:

this.agent = new Agent({
        name: 'brorun-playground',
        userAgent,
        showReplay: false,
        connectionToCore: new RemoteConnectionToCore({
                host: coreAddress,
        }),
});

And then I make use of the agent directly (e.g. to go to a page, wait for stuff, interact with the dom) by using that this.agent.

At the end I also close the agent as follows:

await this.agent.close();

Now the problem is that the Core, which I imagine is started as a child process (haven't checked to be sure) is doing this logging but as we can see from the Logging code it logs directly to the console. The brute-force way would be to capture the logs, but this would not restrict itself to logs just for that one client but capture it all. I could than filter on sessionID but that seems dirty?

What would you recommend on getting these logs per agent? I don't think there is a way currently to do so, but would you be open to think about how we could handle that? Would it be something you would be willing to support (with the default still being that we just log it all directly to the console. E.g. a first initial kick-off for the proposal could be:


Related to this question I also wonder what one can really do with await agent.sessionId. It seems to me that he code to actually get the replay via the official code is not really exposed (e.g. getSession etc). Is one expected to work directly with the undocumented SQL files? Not that it would help me for these purposes as it doesn't seems that the logs are stored in there?

blakebyrnes commented 2 years ago

Good questions. So... the database has a SessionLogs table that already does include all logs for your session, and the DevtoolsMessages table includes all Devtools api/events. So all your logs are already there. We will definitely have features for exporting/streaming the database for a sessionId, but like you said, nothing is documented/official yet.

await agent.sessionId

There isn't a ton to do with this yet beyond knowing which database is associated with a given session. It's mostly used under the covers, and we (in our own use) occasionally log it on the client side so we know which DB to got get


I think you can probably cover what you're trying to do by simply fetching the appropriate DB, but if you wanted to enhance the logs, there's currently an ability to "inject" your own logger (some details on this page): https://secretagent.dev/docs/help/troubleshooting/#debugging-logs

GlenDC commented 2 years ago

Ok for now I'll get the raw data from the DB for now, might be enough. However I guess it might be good if we can work out better access to that session data from the light client, should be possible, especially when on same machine, no?

I mean than such that it can have official support.

blakebyrnes commented 2 years ago

Can you help me understand the use case of needing logs in the client side? It should be possible, but I don't yet see why you want that capability. It seems like a lot of extra unnecessary logging to be piping over the wire.

GlenDC commented 2 years ago

Things can always go wrong, and so it's useful for us to be able to know when it goes wrong that we can look at the logs and see why it went wrong. Having it not stored per session is going to be a nightmare otherwise.

In that sense I would like to have both the logs I output myself from the agent script as well as the logs of the secret-agent service itself.

So nothing special, your classic needs for logs really.

blakebyrnes commented 2 years ago

Ok, I'm with you. Another option here might be to provide a logger API to your client that would store into the database. That way you're not streaming everything (which can be a lot)

GlenDC commented 2 years ago

Can you elaborate further on that, as in the beginning of your sentence I thought I followed you, but the second half confused me.

The logger API gave me one idea, but than the not streaming everything and store into the database parts confused me a bit.

blakebyrnes commented 2 years ago

Let's say you have 1 Client connected to a remote Core. If you are thinking of subscribing to log events from the Client, you're going to be streaming ALL of the logs over the web socket connection (Core -> Client). I think for most cases of a Client library, you don't actually want that. If we were to provide a logger api that allows you to "record" logs from Client -> Core, then they would all be collected into the Core database stored with a single session's entire record. Does that make more sense?

blakebyrnes commented 2 years ago

If you already have a logger, you might just want to log from client to your log "service" with your sessionId, and then inject a logger into Core that logs to your "service" with the sessionId. Ie, you might not need to send your client logs to Core or vice-versa.

GlenDC commented 2 years ago

True, but how would you let the core make use of your Client-defined logger?

And indeed, I didn't think of that, but you would indeed also want to be able to log from client to that same session log, as that's indeed where your agent script logs would come from.

blakebyrnes commented 2 years ago

I think the feature would entail adding a "command" that you can use from client (probably something like agent.createLogger()) and that logger would send commands to Core just how SecretAgent sends commands. They would then get stored in the session db (logger would be per-session).

What log service are you trying to use?

GlenDC commented 2 years ago

No log service, logging to stdout is all I'll do for now, that or show it directly in an app. The other things which uses the stdout captures it from there.

GlenDC commented 2 years ago

Managed to make it work the other day, thanks for your help @blakebyrnes, here is a summary of how I did it in the end:

export enum SessionLogLevel {
    stats = 'stats',
    info = 'info',
    warning = 'warn',
    error = 'error',
}

export interface SessionLog {
    timestamp: number;
    level: SessionLogLevel;
    action: string;
    data: string;
}

const sessionId = await agent.sessionId;
const db = sql(path.join(coreSessionsDir, `${sessionId}.db`), {
    readonly: true,
    fileMustExist: true,
});
const logs = db
    .prepare('SELECT timestamp, level, action, data FROM SessionLogs')
    .all() as SessionLog[];

if (verboseLog) {
    return logs;
}

return logs.filter(
    (logRow) =>
        logRow.level === SessionLogLevel.warning ||
        logRow.level === SessionLogLevel.error,
);

Works for me. I think this might also be good enough for most Session Data purposes, including logging. Only thing that would be nice if there is an official function to get to the core data (can go directly over the FS basically, locally or remotely), just somehow officially would already be nice enough for what is me concerned.

Only thing that is still missing is that direction from client to server you were mentioning. Would indeed be good if somehow we could log to our session logs indeed, so that the custom logs are also part of it.

An easy fix for me can be to just do it manually and merge them, that is good enough for me, but I guess we might as well do it properly within Secret-Agent if you're on board.

GlenDC commented 1 year ago

This approach I am still taking more or less ,where I read my logs via SQLite and than send them. Would be nice to somehow get this integrated somehow. Same for the cleaning process. I feel that could be 1 epic. I mean I get the use case where you do want to keep it, but might be cool to offer the featue to have it.

I think we can just implement this as a separate ulixee server sub module or w/e which allows you to manage the SQL Session Data.

For the retrieving bits we can for now keep it as simple as only being able to retrieve the logs. This way you do not need to expose the entire data structure of the entire DB, yet.

For the wiping we can already allow to wipe an entire session, as it doesn't require you to expose anything, it's essentail a single blackbox function.

So in short, I think we can provide submodule somewhere for this which provides a class that you can use to interact with the DB.

How does that sound @blakebyrnes ?

blakebyrnes commented 1 year ago

I think we're talking about two pretty different things here. I was imagining an ability to simply "add" to the logs database. We do have an API layer in Core already that exposes a good bit of this kind of information. Maybe exposing logs in there is the right approach for that piece? It's not a documented feature at the moment ... so we'd need to document it if we decide this is going to become first class. I'm on the fence about making that a "stable" api that we stick to though. It continues to move around a good bit..

Your second part about managing sessions and removing the databases is a very valid item we get a lot of requests for. I think a pattern for this would be very nice... however, I'd have to give it a good bit of thought. I think it might be simply subscribing to a 'session-complete' event in Core that exposes a db path, or something to that effect. Then the process can choose whether they want to zip it, send it off, or just wipe it. I think this should probably go in a different ticket (or the "log to session database" api can be a different ticket)

blakebyrnes commented 1 year ago

See #166 for a feature supporting shutdown