Developper documentation relation postgres <--> Ceph

VannTen commented 2 years ago

It's not very clear from the code or the documentation what Ceph is used for:

I can see here that we store task results, here how to access it, but there should a high level description of it's role, comparable to the postgres schema that we can generate with generate-schema.

Currently it's a bit hard to get a definite idea of what is inside (and consequently, what should be deleted, see #2657)

Also, is it only used as an S3 store ? In that case, why reference Ceph particularly ?

/kind documentation

VannTen commented 2 years ago

@thoth-station/devs I appreciate if anyone has some insight or pointers regarding this.

I see a number of purge_* methods in thoth/storages/graph/postgres.py with ceph.delete call inside. Does that mean that deleting data from ceph is already taken care of with any given state for postgres ?

VannTen commented 2 years ago

/cc @harshad16

harshad16 commented 2 years ago

Lets gather what kinda details would help everyone. For start, lets try to understand the ceph and postgres database with current problem statment in hand i.e deleting existing package index:

Ceph is document storage, we tend to keep a lot of documents stored in our ceph services based on various components. for example, if we run any execution of solver, we tend to keep its results in ceph services and sync the result in Postgres db with help of storages module. The document is stored in ceph , for us to replicate the result to postgres db ever. Other examples could be: Adviser run, package-extract run.

List of these kinda result can be reference from here: https://github.com/thoth-station/storages/blob/master/thoth/storages/__init__.py#L20-L42 The noticeable action here would be that these result stored in ceph services are mostly a component run.

In the case of the python package index: we won't be having a specific document for each index register in ceph at least. we do have a table in Postgres db.

Hoping we understood what kinda result gets into ceph, based on above comment. Lets try to understand what to remove from ceph service, when purging something from postgres db.

As i mentioned these docs in ceph services are kept to sync them to postgres db if ever needed in future. So when we remove a group case of something, we try to remove it from ceph as well, so it doesnt get sync accidentally again. for example if we deleted solver runs for a specific python-os version, we tend to purge docs related to it.

As we won't have documents directly depending on the index. we should check for indirect relation:

one method could be to check through postgres relation or
as all our call are via POST API Call, we can check for dependency there.

we could find that solver result doc keeps track of index. So we should work towards it, checking if sync logic consider index, if it does, we should purge solver docs along with index deletion. This could be little tricky , so we require further discussion, if needed.

Additional answers:

Also, is it only used as an S3 store ? In that case, why reference Ceph particularly ?

Ceph service is an open source storage service, which uses a similar S3 API Call, so in our documentation sometimes we reference s3 call or s3 store. However, Ceph is a service deployed and being used, as it calls are also s3 it would show up in various places.

VannTen commented 2 years ago

Let's see if I can summarize what I understand, to see if I really do understand. I'll see if I can work on a PR to add that in docs after that.

We store documents in Ceph. Those are the results of various kind of operations. Those documents are original data (They can't be reconstructed from the db postgres). Postgres references those documents.

Follow ups questions:

I'm not sure to understand what's going on in the sync process. If something exists in Ceph, it will be referenced/created in postgres ?
Can we map one document in ceph to one entity only in postgres ? In other words, do we have a many-to-one relation between ceph documents and postgres entries ? (or another relation, or does that depends ?)
Do the documents in ceph references back to postrgres entries ?
Which one of them is the single source of truth, postgres or ceph ? Or do they each participate in it ?

On Fri, Jul 15, 2022 at 01:12:07AM -0700, Harshad Reddy Nalla wrote:

Additional answers:

Also, is it only used as an S3 store ? In that case, why reference Ceph particularly ?

Ceph service is an open source storage service, which uses a similar S3 API Call, so in our documentation sometimes we reference s3 call or s3 store. However, Ceph is a service deployed and being used, as it calls are also s3 it would show up in various places.

Yeah, I see what Ceph is. My questions is more, do we use it exclusively through the S3 API, or do we also use other features, like CephFS or block storage ?

In the first case, we might drop references to Ceph in docs and in the code and simply works with an S3 API, which could be backed by any service providing that S3 API (the fact that it's backed by Ceph would be an operational detail).

VannTen commented 2 years ago

@mayaCostantini Any thoughts ?

harshad16 commented 2 years ago

Let's see if I can summarize what I understand, to see if I really do understand. I'll see if I can work on a PR to add that in docs after that. We store documents in Ceph. Those are the results of various kind of operations. Those documents are original data (They can't be reconstructed from the db postgres). Postgres references those documents.

Yes original result data, cant be reconstructed from db postgres, only reconstruction via re-running the operations again.

Follow ups questions: - I'm not sure to understand what's going on in the sync process. If something exists in Ceph, it will be referenced/created in postgres ?

Not true for all sync, some of them are designed in that way for example: graph-refresh component schedule the package solver which are missing in the postgres db, so it would try to re-sync the document if not in the postgres db from ceph document.

Can we map one document in ceph to one entity only in postgres ? In other words, do we have a many-to-one relation between ceph documents and postgres entries ? (or another relation, or does that depends ?)

The map would be more of many-to-many, Postgres has tables and ceph db has multiple directories (should have been different buckets, however, we use one bucket but a different directory in it for different operations). for example: adviser result and solver result are two different documents, and they would be used in different places in postgres db tables.

Do the documents in ceph references back to postrgres entries ? - Which one of them is the single source of truth, postgres or ceph ? Or do they each participate in it ?

ceph doesnt reference back to postgres, its other way around, postgres reference ceph data via document-ids

The connotation of both is different, so saying one of it is signle source of truth would be right. For the devs of data service, the postgres db would be source of truth as the application is designed on the tables. for the devs of the data aggregation, the ceph store would be source of truth. so we have to maintain the sync between both for great results.

Yeah, I see what Ceph is. My questions is more, do we use it exclusively through the S3 API, or do we also use other features, like CephFS or block storage ? In the first case, we might drop references to Ceph in docs and in the code and simply works with an S3 API, which could be backed by any service providing that S3 API (the fact that it's backed by Ceph would be an operational detail).

we use it through s3 api or package with support s3 , don't know what these packages have underlying in their architecture. please feel free to update the docs, we can discuss the specifics of the docs in the review.

VannTen commented 2 years ago

Ok, I think I have an relatively good overview, I'll get started :+1:

mayaCostantini commented 2 years ago

@VannTen I think Harshad provided a great explanation, I don't see any more details to add that could be useful. Thanks @harshad16 !

VannTen commented 2 years ago

goern commented 2 years ago

is this something we can extract/summarize out in to the docs?

VannTen commented 2 years ago

I'm not sure.

I think there is two public for the information, Thoth devs and Thoth ops, and it's not exactly the same information (checkout #2661)

Developer docs should stay in this repo I think, but I could see operator docs regarding the storage models being centralized with the rest of the operational documentation (which is the point of the thoth-application issue if I read it correctly).

VannTen commented 2 years ago

/remove-kind bug /kind documentation /priority important-longtem

Related (closely) : #2691

sesheta commented 2 years ago

@VannTen: The label(s) priority/important-longtem cannot be applied, because the repository doesn't have them.

In response to [this](https://github.com/thoth-station/storages/issues/2658#issuecomment-1236921151): >/remove-kind bug >/kind documentation >/priority important-longtem > >Related (closely) : #2691 > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

VannTen commented 2 years ago

/priority important-longterm

VannTen commented 1 year ago

/sig stack-guidance /remove-priority important-soon also, see #2767

thoth-station / storages

Developper documentation relation postgres <--> Ceph #2658