hackbg / undexer

🎉 The Undexer 🎉 Namada network indexer powering https://shielded.live/. See also: https://github.com/hackbg/undexer-node
5 stars 6 forks source link

More trouble #8

Closed opsecx closed 2 weeks ago

opsecx commented 3 months ago

@egasimus now I try again, fresh complete repull.

It seems everything has been altered with how db is handled, not much to my liking.

first, when I run node src/config.js, I get:

"file:///home/(user)/undexer/src/config.js:87 VALIDATOR_FETCH_PARALLEL, ^

ReferenceError: VALIDATOR_FETCH_PARALLEL is not defined at file:///home/(user)/undexer/src/config.js:87:5 at ModuleJob.run (node:internal/modules/esm/module_job:222:25) at async ModuleLoader.import (node:internal/modules/esm/loader:316:24) at async asyncRunEntryPointWithESMLoader (node:internal/modules/run_main:123:5)

Node.js v20.15.0"

So I can't check if it parses the env variables correctly.

When I run ./undexer index, I get the following:

"./undexer index ⏳ Starting @hackbg/undexer 3.0.0... ⏳ Compiling TypeScript... ⌛ Compiled TypeScript in 0.032s

✔ DB Creating database "housefire-envelope.b8f955720ab"... × DB error: permission denied to create database at /home/(user)/undexer/node_modules/.pnpm/pg@8.12.0/node_modules/pg/lib/client.js:526:17 at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async file:///home/(user)/undexer/src/db.js:24:3 { length: 93, severity: 'ERROR', code: '42501', detail: undefined, hint: undefined, position: undefined, internalPosition: undefined, internalQuery: undefined, where: undefined, schema: undefined, table: undefined, column: undefined, dataType: undefined, constraint: undefined, file: 'dbcommands.c', line: '332', routine: 'createdb' } × DB Failed to create database "housefire-envelope.b8f955720ab". See above for details."

I'm not about to give undexer the master-password for my entire postgres installation, so it can create databases at will.(!)

The previous setup I gave it username and password for a specific database which I expected it to use (may I recommend using separate schemata like Namadexer did for the different network ids?), now it seems to want to create databases for each network? Unless it doesn't parse the .env which I halfways suspect is the issue here, but again can't test.

Please advise.

opsecx commented 3 months ago

Behaves the same way regardless of whether .env is present, so starting to think the issue is in parsing that.

egasimus commented 3 months ago

ReferenceError: VALIDATOR_FETCH_PARALLEL is not defined

Should be fixed in bd54544

I'm not about to give undexer the master-password for my entire postgres installation

Makes sense

, so it can create databases at will.(!)

It only creates a database corresponding to the chain id - maybe give it a subuser that can create DBs but not access other stuff?

(may I recommend using separate schemata like Namadexer did for the different network ids?),

Unfortunately Sequelize's support for that isn't too great, but maybe something can be figured out.

Thanks for the feedback!

opsecx commented 3 months ago

It only creates a database corresponding to the chain id - maybe give it a subuser that can create DBs but not access other stuff?

Why was the model changed from the more obvious one of simply providing the db to use?

opsecx commented 3 months ago

Rather major change imo (and also one most db admins would be super uncomfortable with..)

opsecx commented 3 months ago

Not sure if this helps re schemata, but just a suggestion of course. (other options could be table names prefixes for different networks etc, but using schemata seems a lot more structurally clean to me)

https://stackoverflow.com/questions/42497254/sequelize-schema-for-postgresql-how-to-accurately-define-a-schema-in-a-model

egasimus commented 3 months ago

Yes, this is somewhat context-dependent. For example, we run a separate Postgres instance just for Undexer - so this change makes it 1 step easier to get things running when testnet is upgraded.

On the other hand, If you're using the same PG to back various things, evidently the change is suboptimal. That kind of thing is exactly why your feedback is so valuable!

Commit https://github.com/hackbg/undexer/commit/aee7a73f8beefd52dbf3e80b2b5210fc3ab500b1 should make it work both ways, please let me know if it fixes it for you.

There seem to be various opinions on how schemas and DBs should be used, see https://stackoverflow.com/questions/1152405/is-it-better-to-use-multiple-databases-with-one-schema-each-or-one-database-wit. Those all make sense to me, but it seems like they depend on everyone's use case. Personally, I'm partial to the following data architecture:

And now that you've helped me articulate it, we might even get there - eventually :grin:

opsecx commented 3 months ago

Commit https://github.com/hackbg/undexer/commit/aee7a73f8beefd52dbf3e80b2b5210fc3ab500b1 should make it work both ways, please let me know if it fixes it for you.

Exactly how will the configuration pick up which world we are operating in? ie how do I tell it to do it the way I want in config?

There seem to be various opinions on how schemas and DBs should be used, see https://stackoverflow.com/questions/1152405/is-it-better-to-use-multiple-databases-with-one-schema-each-or-one-database-wit. Those all make sense to me, but it seems like they depend on everyone's use case.

I very much agree with the person who says this in your link: "A PostgreSQL "schema" is roughly the same as a MySQL "database". Having many databases on a PostgreSQL installation can get problematic; having many schemas will work with no trouble. So you definitely want to go with one database and multiple schemas within that database." (you will see this is by far the most upvoted answer. over and against the second answer taking the opposite position, they're describing a different usecase. in addition having everything in one db gives you some interesting possibilities in using queries to compare data between networks, something that won't be possible if completely separating the databases)

Personally, I'm partial to the following data architecture:

  • 1 database per undexer instance

  • 1 schema per schema version, to facilitate upgrades

I think this is a slight confusion of terminology. Just because both are called "schema", it's really two separate concepts you're discussing imo. I don't see how it benefits anyone to retain older db-schematics layouts when upgrading (except for creating redundant data in the db). (again I refer to the quote above - "schema" in postgres isn't just the schematic layout but data containers)

  • add chainId column in each table to disambiguate between networks

the solution with chainID prefixes is workable but imo using the postgres schema concept is much closer to logically separating what you want to achieve. (one "schema" (postgres terminology) per network.) also makes it a lot handier to delete a network from the db (DROP SCHEMA CHAINID). I see why one could think multiple databases are just as good, but that creates the issue of the software wanting some level of superuser access to work, as well as potentially cluttering the postgres installation if used for multiple purposes and not just undexer. Postgres schemas are perfect for the usecase at hand.

egasimus commented 3 months ago

Exactly how will the configuration pick up which world we are operating in? ie how do I tell it to do it the way I want in config?

I don't see how it benefits anyone to retain older db-schematics layouts when upgrading (except for creating redundant data in the db).

Currently, the simplest way for us to be certain that the indexed data is complete and consistent after a schema change is to... just reindex everything from scratch :grimacing: Which is slow, and quickly becomes unworkable.

Using PG schema objects to represent different versions of the database's schema would make it somewhat easier to upgrade the database schema incrementally; and, on the meta level, to make incremental progress towards implementing proper schema migrations :grin:

For example:

Just because both are called "schema", it's really two separate concepts you're discussing imo.

I absolutely discern them as two separate concepts; but I don't think the PG devs would name them the same by accident. That's why I suspect that the intended use case for "PG schema" objects is to represent the evolution of the schema. Of course, this is just conjecture; imho it would make more sense to be looking at the documentation, not so much at StackOverflow. But SO appears first in search results :man_shrugging:

the solution with chainID prefixes is workable but I see why one could think multiple databases are just as good, but

Neither of those responds to just having the chain ID as a field in the data: and not needing multiple databases, nor multiple schemas (except, possibly, to facilitate schema migrations), nor table name prefixes - in the first place.

Historical tidbit: Undexer v1 stored everything as directories of JSON files :melting_face: This avoided a lot of the finer points of "the one true way to use Postgres correctly"... on the other hand it was a pretty batshit solution :bat: It was a compromise solution - the original idea was to index directly to HTML files and do away with even more of the complexity. (After all, indexers are a stopgap solution - all of the data should be trivially queriable from the node's built-in database, except reasons.)

opsecx commented 3 months ago

Exactly how will the configuration pick up which world we are operating in? ie how do I tell it to do it the way I want in config?

  • If you don't specify DATABASE_URL, by default the DB name will equal CHAIN_ID

  • If you do specify DATABASE_URL, you can make the DB name whatever you want, as before.

  • If your user does not have rights to CREATE DATABASE, you will have to create the database manually.

    • Come to think of it, in this case it might actually crash even if the DB exists; you're welcome to contribute the fix.

Alright, I'll give it a go and get back with results.

I don't see how it benefits anyone to retain older db-schematics layouts when upgrading (except for creating redundant data in the db).

Currently, the simplest way for us to be certain that the indexed data is complete and consistent after a schema change is to... just reindex everything from scratch :grimacing: Which is slow, and quickly becomes unworkable.

Using PG schema objects to represent different versions of the database's schema would make it somewhat easier to upgrade the database schema incrementally; and, on the meta level, to make incremental progress towards implementing proper schema migrations :grin:

For example:

  • we have v1.blocks and v1.txs

  • we want to change the schema of how blocks are stored in a backwards-incompatible way

  • we create v2.blocks, migrate the data to that

  • once we're sure all is good, we drop v1.blocks

  • meanwhile, v1.txs has been left alone throughout (there has been no opportunity to accidentally drop it or corrupt it or anything)

I don't see how this precludes using schemata as network containers. Having an upgrade facility is good, and I can see how it makes sense using schemata for this (though temporary tables do the same job), but nothing precludes doing this while using schemata/schemas for networks and renaming them to temporary during upgrade procedure..

Just because both are called "schema", it's really two separate concepts you're discussing imo.

I absolutely discern them as two separate concepts; but I don't think the PG devs would name them the same by accident.

in this case I think they did a bad call on the naming. One search results I can't access fully says "Specifically, schemas are the implementation in postgres of the namespace concept.", which makes a lot of sense to me.

That's why I suspect that the intended use case for "PG schema" objects is to represent the evolution of the schema. Of course, this is just conjecture; imho it would make more sense to be looking at the documentation, not so much at StackOverflow. But SO appears first in search results :man_shrugging:

Documentation does not hint at this understanding. In fact it states:

"There are several reasons why one might want to use schemas:

-To allow many users to use one database without interfering with each other.

-To organize database objects into logical groups to make them more manageable.

-Third-party applications can be put into separate schemas so they do not collide with the names of other objects.

Schemas are analogous to directories at the operating system level, except that schemas cannot be nested."

So personally I think they used a bad nomenclature for "schema" in this regard, more apt would be "containers" or "spaces" or such. Anyways here's the doc (I maintain it would be a much cleaner design sticking to one db and using one schema per network, with temporary renames for upgrades etc)

https://www.postgresql.org/docs/current/ddl-schemas.html

the solution with chainID prefixes is workable but

I see why one could think multiple databases are just as good, but

Neither of those responds to just having the chain ID as a field in the data: and not needing multiple databases, nor multiple schemas (except, possibly, to facilitate schema migrations), nor table name prefixes - in the first place.

Sure that would be workable, missed this option if you described it above - thought you only talked about prefixes. Was about to say I like this a lot, but I wonder if it would not be an inefficient model for queries if multiple networks have data tables that grow sufficiently large. Would also make for easy mistakes when querying - using separate schemata would make any such mistakes impossible. But yes it's certainly an idea, but for the reasons stated I don't favor this (primarily due better segregation of data in the other models).

Historical tidbit: Undexer v1 stored everything as directories of JSON files :melting_face: This avoided a lot of the finer points of "the one true way to use Postgres correctly"... on the other hand it was a pretty batshit solution :bat: It was a compromise solution - the original idea was to index directly to HTML files and do away with even more of the complexity. (After all, indexers are a stopgap solution - all of the data should be trivially queriable from the node's built-in database, except reasons.)

:)

opsecx commented 3 months ago

Come to think of it, in this case it might actually crash even if the DB exists; you're welcome to contribute the fix.

This is exactly what happens now. It attempts to create the named database I supplied (which already exists) with a "failed to create database". For now solved with giving user access to create databases, nbd (but should probably be checked at some point). Thanks for the fix on selection of database!

opsecx commented 3 months ago

Excited to see if it will keep indexing after last block now.. Currently syncing

opsecx commented 3 months ago

I got another crash between then and now. What's more worrying is it seems when it picks up again it starts from scratch and truncates (or recreates?) the data tables, so it actually does start syncing from scratch. Can't say if it made it all the way up to final block as unfortunately didn't query the db before I started reindexing. So now it's reindexing again. Will keep an eye on it and report back here. (the specific error was something like not being able to query validator set for epoch ~4200 if memory serves. looking at testnet blockheights - housefire - that could well correspond to last epoch at the time)

egasimus commented 2 weeks ago

Closing, let's continue in https://github.com/hackbg/undexer/issues/12 (and there is another retention issue in https://github.com/hackbg/undexer/issues/13). I don't see yet the moving part which causes these, but evidently there's something going on.