Object storage vs disk storage

Krithika3 commented 1 year ago

Describe the bug

Object storage is not working as anticipated as it

To Reproduce

Run pyroscope in microservices mode or monolith and configure object storage using S3. Run it for a few hours (3-4) so that profiles are written to S3. Delete your microservices deployment and then re-apply it, you should see profile data from the previous few hours but you do not.

Expected behavior

Environment

Infrastructure: [e.g., Kubernetes, bare-metal, laptop] - k8s
Deployment tool: [e.g., helm, jsonnet] - helm

Additional Context

Krithika3 commented 1 year ago

Does object storage with S3 work as expected like pulling profile information in case of pod deletion or pod recycling? I can use a PVC which seems to work as expected but the mode expected on the PVC would be ReadWriteMany, I believe my pvc restrictions allows only ReadWriteOnce which means pod on the same node can read and write to the PVC

kolesnikovae commented 1 year ago

Hello @Krithika3, thank you filing the issue.

Run it for a few hours (3-4) so that profiles are written to S3.

Can you confirm the blocks were actually uploaded to S3 and you can browse them there? By default, ingesters produce blocks every 3 hours and some time is needed for a block to be uploaded.

you should see profile data from the previous few hours but you do not.

Could you please clarify if you see any data after the re-deplyment? Pyroscope serves queries to the recently ingested data from the local disks on ingesters. There is a configuration parameter responsible for that:

# The time after which a metric should be queried from storage and not just
# ingesters. 0 means all queries are sent to store. If this option is enabled,
# the time range of the query sent to the store-gateway will be manipulated to
# ensure the query end is not more recent than 'now - query-store-after'.
# CLI flag: -querier.query-store-after
[query_store_after: <duration> | default = 4h]

Note that the time span must be larger than

  # Upper limit to the duration of a Pyroscope block.
  # CLI flag: -pyroscopedb.max-block-duration
  [max_block_duration: <duration> | default = 3h]

Thus, if your persistent volumes were removed, you won't see the data in the object store for the duration of query_store_after, even if the blocks were produced and uploaded.

I'm also wondering about the PV durability – could you please confirm that the same volumes were mounted and no data was deleted?

Does object storage with S3 work as expected like pulling profile information in case of pod deletion or pod recycling?

Pyroscope has two storage tiers: one for hot data (ingesters) and another for cold data (object storage). The only reason of having the second tier is to reduce the TCO as it is obviously cheaper than local disks. It is not meant to be used as a fall-back storage for the first tier (ingesters), which already has replication that should prevent any data losses caused by a disk/pod failure/deletion. However, if you delete majority members of the cluster (e.g. 2 of 3 ingesters total), the data loss is inevitable.

I can use a PVC which seems to work as expected but the mode expected on the PVC would be ReadWriteMany, I believe my pvc restrictions allows only ReadWriteOnce which means pod on the same node can read and write to the PVC

Persistent volumes are used by ingesters. Each of them has it's own local storage which is not shared with others. The replication is done at the distribution stage and not at the storage level, therefore shared disk access is not needed.

Krithika3 commented 1 year ago

Thanks for the detailed response @kolesnikovae . I was under the impression that I can either use a PVC or an S3 storage and not a combination of the two for the persistence of data. My requirement is fairly simple that even if I lose all the ingestor, store-gteway, distributor pods I am able to get that data available when those pods are replaced aka: the data is available in either a persistent storage like S3 or in a pVC. what would be the best way to deal with this in this scenario? Thanks

Also I can confirm the data is written to S3 in the bucket/anonymous/phlaredb//profiles.parquet, bucket/anonymous/phlaredb//meta.json, bucket/anonymous/phlaredb//index.tsdb, bucket/anonymous/phlaredb//symbols

Krithika3 commented 1 year ago

And when I talk about viewing this data, I am talking about viewing it in UI aka port-forwarding query-frontend svc

Krithika3 commented 1 year ago

https://grafana.com/docs/pyroscope/latest/reference-pyroscope-architecture/ This architecture is awesome. Thank you

kolesnikovae commented 1 year ago

Pyroscope requires persistent volumes, while an object storage is optional. In order to tolerate failures of local components such as ingesters, distributors, store-gateways, etc. you need to deploy multiple replicas of each component, and make sure the replication factor is configured in accordance to your requirements (3 by default)

Krithika3 commented 1 year ago

That makes sense. I have deployed all my pods in the microservices architecture with multiple replicas. Just have one question, we do have scenarios where clusters are upgraded and all the replicas can go down and come back up (not together but one after the other). How do I guarantee that the data persisted to disk by those replicas (prior to the upgrade) will be available to me.

kolesnikovae commented 1 year ago

we do have scenarios where clusters are upgraded and all the replicas can go down and come back up (not together but one after the other). How do I guarantee that the data persisted to disk by those replicas (prior to the upgrade) will be available to me.

All the components except ingester and store-gateway are stateless, therefore they should survive rollouts (and node upgrades) fine in most cases. You only may need to make sure, the pod disruption budget (PDB) is created for each of the deployments (enabled by default in the helm chart).

Failure/unavailability of a single ingester replica does not cause any losses, and multiple simultaneous failures might be tolerated with a higher replication factor. Depending on the cluster configuration you may need to configure pod anti-affinity to ensure that multiple ingester pods are never scheduled on the same node (the same for store-gateways). If replicas restart gracefully, local data will be available after the restart (provided that the PV provider ensures durability of the volumes), but data ingested during the unavailability time is not recovered, as it is expected that there is at least one more replica holding the data.

store-gateways keep in memory critical information about the blocks in the object storage (e.g block meta, tsdb index, and so on). Those are stateful but do not have any persistent data volumes. By default, store-gateways only maintain a single replica, therefore they can't tolerate failure of a single instance. However, as this information can be recovered at the service start, this should not be a problem. Otherwise, you'd need to adjust this configuration parameter:

    # The replication factor to use when sharding blocks. This option needs be
    # set both on the store-gateway, querier and ruler when running in
    # microservices mode.
    # CLI flag: -store-gateway.sharding-ring.replication-factor
    [replication_factor: <int> | default = 1]

PDB and anti-affinity setup should be also applied to store-gateways to avoid interruption of the service during the cluster maintenance.

Krithika3 commented 1 year ago

Thanks @kolesnikovae . This was pretty helpful. So this means I can let the microservices setup as is without needing to configure any specific PVCs for the ingestors or the store-gateways (as long as I run multiple replicas)? I am running multiple replicas of all components (distributor, ingestor, querier, store-gateway)

kolesnikovae commented 1 year ago

I'm happy to help!

So this means I can let the microservices setup as is without needing to configure any specific PVCs for the ingestors or the store-gateways (as long as I run multiple replicas)?

I'd say PVCs are required for ingesters if you want your data to be stored in a durable fashion. Without a volume (or with an ephemeral volume) ingesters can't store data locally. So, for example if they are restarted sequentially one after another, recently ingested data will not be available until it becomes "visible" in the object storage (see querier.query-store-after, 4 hours by default).

Krithika3 commented 1 year ago

I see, how can we set separate PVCs for each ingester? Like is there a way to configure that?

kolesnikovae commented 1 year ago

There are two options:

Let k8s to manage it for you. Setting the pyroscope.persistence.enabled option to true in the helm chart should do the trick; pyroscope.persistence.size should also be adjusted as per your requirements. StatefulSet controller handles persistent volumes in a special way, dynamically creating PVCs for replicas (however, depending on how your cluster is configured (PV provisioner), you may need to pre-create a PV: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#provisioning).
Manually create all the PVs and specify the PVC template in the helm chart via pyroscope.persistence.existingClaim (more about the pre-created PVs).

Update: I doubt our helm chart allows for the second option actually.

Krithika3 commented 1 year ago

Thanks @kolesnikovae

Krithika3 commented 1 year ago

I did exactly what is prescribed in option 1

grafana / pyroscope