Kingfisher personal SQL user account

kindly commented 3 years ago

[If you work at/with OCP and the account is for yourself, simply click "Submit new issue". Otherwise, please share the full name of the person whom the account is for, briefly describe their use cases, and indicate which schema ("views") they need access to.]

Does this account have write access in order to create temporary tables? As that would be useful.

jpmckinney commented 3 years ago

This account would be for a read-only user that isn't granted the TEMPORARY privilege.

What do you want to do, and what privileges do you need to which schemas?

kindly commented 3 years ago

I only really require CREATE, INSERT table access to schema view_data_collection_1697. TEMPORARY would be fine but no ideal.

My only worry is that I may recreate the schema at some point so would have to ask again in that case.

jpmckinney commented 3 years ago

Hmm, I don't know if there's a way to solve that using groups or other features: https://github.com/open-contracting/deploy/issues/170

If we changed Kingfisher Summarize to allow overriding the view_data_ prefix, then you could create a schema that matches the username, for which it's easier to have specific privileges.

kindly commented 3 years ago

I think it depends on how many privelages that would be acceptable to give away. We could just hand over ownership of the schema to an (optional) user specified in Kingfisher Summarize. Or have a seperate command for that. The issue with this is the the ocdskfp user will not be able to delete the schema later and the user would have to specify their own connection string in order to delete it.

Any schema owned by the user (even named the same as the user) would be good for my use case. Just any space to create tables (to allow multiple other queries to reuse a long running query) is what I need.

jpmckinney commented 3 years ago

I'll look into it. By the way, is the red flag analysis tied to a CRM issue, or is it an independent project?

jpmckinney commented 3 years ago

By the way, the main server's PostgreSQL port is only open to the replica, so you would need to tunnel or SSH into the server to perform write operations.

jpmckinney commented 3 years ago

~~Actually, I think the user would get the TEMPORARY privilege, since it's a default that hasn't been revoked.~~

Update: This privilege has now been revoked. Update 2: Restored, as low risk since only available to users who can connect to the main server over SSH.

kindly commented 3 years ago

I'll look into it. By the way, is the red flag analysis tied to a CRM issue, or is it an independent project?

As far as I understand it is a separate project, with separate funding.

By the way, the main server's PostgreSQL port is only open to the replica, so you would need to tunnel or SSH into the server to perform write operations.

Oh I misunderstood the replication. I thought it would be possible to add tables (potentially only temporary) on the replica without it effecting the process. However, it is hot standby physical replication so that everything on the replica is completely read only. In any case it would still be possible to run anything on that server using the ocdskfp user by sshing in, but I will try and avoid that as it defeats the point of the replica, and means that all code is not explicitly run in the notebook.

Regardless a personal account is still good practice as I am using the generic one in the notebook currently.

The only solutions I can see to this is to use logical replication (would be very tricky to implement) or have a completely separate database for writing data using a postgres fdw. It would be a lot slower though.

jpmckinney commented 3 years ago

as it defeats the point of the replica

I'm not sure I follow?

Yes, we do hot standby physical replication, since our use case is availability. We don't presently have a use case for logical replication. SQL users are provided to third parties for data access only. Complex analyses that require writing tables should instead be performed in a user's own database. Otherwise, we end up managing a "database as a service" which isn't our intention (setting quotas, managing load, etc.). For example, OCCRP uses a server-side cursor to process all available data in the public schema; they store the results in their own databases.

I've created the user draznick and will send the password via chat.

kindly commented 3 years ago

I've created the user draznick and will send the password via chat.

Thanks

I'm not sure I follow?

The point of the replica (I thought) was to take load of the main server as well. I was thinking logical replication for the main kingfisher process database and not have replication on the summarize tables.

Complex analyses that require writing tables should instead be performed in a user's own database

This seems a high barrier for the helpdesk and may limit useful work.

jpmckinney commented 3 years ago

I sent the password earlier via WhatsApp.

This seems a high barrier for the helpdesk and may limit useful work.

When I asked "is it an independent project?", I was asking whether the project was for OCP or not. Anyhow, I found the TOR that I assume Ben Fernz prepared, so you are working for OCP. As such, we can explore options.

Unless we change the replication setup, there is only one existing place where you can write. I don't see this change happening on a short timeline. I can create a draznick schema with draznick as the owner, so that you can write tables there. Not all code will be run by the notebook, but this seems like a minor inconvenience. I recommend copying the SQL to a file on the main server, and running it with psql -U draznick -W -f YYYY-MM-DD-description.sql ocdskingfisherprocess

The point of the replica (I thought) was to take load of the main server as well. I was thinking logical replication for the main kingfisher process database and not have replication on the summarize tables.

If the replica went down, we'd currently change DNS to point the replica (postgres-readonly) to the main server and open the port. That way, analysts have uninterrupted access to the summaries. If the summaries were only created on the replica (making it essentially a Kingfisher Summarize server), they'd need to wait for the replica to come back to continue analysis.

To achieve the same availability in that scenario, we'd need to use pglogical for either bi-directional or multi-master replication, but we prefer to stick to just PostgreSQL for easier management and upgrading. We can consider that, but right now we have only one request to do so.

So, yes, another goal is to move non-application queries off the main server, so that each activity (processing vs analysis) doesn't interfere with the other. (Another goal is security – the list goes on.)

jpmckinney commented 3 years ago

Alternately, we have created separate databases, like for https://github.com/open-contracting/oc4ids_database. We could do the same and set up a foreign data wrapper as suggested.

I don't think it will be that slow, and there are simple ways to make it faster:

Set fetch_size https://www.cybertec-postgresql.com/en/foreign-data-wrapper-for-postgresql-performance-tuning/
Set use_remote_estimate https://blog.imaginea.com/foreign-data-wrapper-query-performance/

Analysts seem accustomed to waiting minutes/hours for other operations, but with FDWs we're talking differences in milliseconds or at most seconds.

kindly commented 3 years ago

Thank you for the login and sorry for the confusion about the work, just I did not find the issue on the CRM as it is separate from the normal helpdesk contract.

I found an interim (fairly hacky) solution to the problem in the meanwhile. I installed a postgres database in the notebook container, which surprisingly works fairly well (but will be wiped when the notebook container is).

From there I used FDWs in the notebook instance, which can mirror the schema I am working on. This means I can make local tables and they actually give you about 60GB of disk. This will be a fairly similar setup to the external database solution above (except for the more temporary nature of the data in the notebook and slower performace).

The main speed issue is that my use case is when making local tables and then a need to join back to the server, which will be inevitably slow I think. Will look at the tips in the links above to see if they help in this case. However, I just copied the data (as its not large) to the local server for now for the cases that needed to be faster.

Happy to close this issue for now, and explore the fdw way as it will be a good way to explore the pros and cons of it. . We could consider a database if this method works and at some point need more persistence/speed.

jpmckinney commented 3 years ago

Sounds good! Happy to create an AWS RDS database for persistence/speed.

Alternately, you can use the same RDS database as used for OC4IDS analysis. That database doesn't have a strict 'identity'; it's just designed for analysis work, until we have clear and significant needs that would motivate a change to our servers.

It'll be interesting to document the FDW setup in case it is relevant to future analysis (which I guess will be auto-documented if it's all in the notebook).

open-contracting / deploy

Kingfisher personal SQL user account #236