CSV export on S3 should not replace the whole file each time

opencost / opencost

Cost monitoring for Kubernetes workloads and cloud costs

http://opencost.io

Apache License 2.0

5.17k stars 546 forks source link

CSV export on S3 should not replace the whole file each time #2185

Open sdaberdaku opened 1 year ago

sdaberdaku commented 1 year ago

Is your feature request related to a problem? Please describe. Currently the CSV export functionality on S3 replaces the whole file each time. S3, like other object storage services, does not support file appends, only replacement. This means that the CSV export is being exported as a whole each time.

Describe the solution you'd like It would be better if the exports were split per reference day, with one new CSV file created each day with the corresponding timestamp/date in the filename. This way the CSV export on S3 would scale much better.

Describe alternatives you've considered N/A Additional context N/A

AjayTripathy commented 1 year ago

Seems like a great idea @sdaberdaku ! Would be great to get a community contribution on this one.

lmello commented 12 months ago

This would be a nice to have feature. possible implementation suggestion:

1 - Generate the export data for the last day (Csv exporter set the window to YYYY-MM-${DD-1}-00:00:00Z,YYYY-MM-${DD-1}-23:59:59Z ) 2 - No need to download the csv file anymore on the csv exporter, always generate last full day when run. 3 - Upload the resulting export to a prefix, sharding by date. prefix/year=YYYY/month=MM/day=DD/export.csv (This is hive style partitions )

To accomplish this, will need some changes on the csv_exporter and the filemanager class to add support for prefixes instead of a single filename.

Extra feature: Handle backfills of CSV data

Possibly add some command line parameter or endpoint to the API that could be called to export a specific date in case we need to backfill. opencost:9003/csv_export/backfill?from=2023-06-22&to=2023-07-10 , ./app --csv_export --backfill --from 2023-06-20 --to 2023-07-10

sntxrr commented 11 months ago

My company could really use a feature as describe above. We have many clusters we want to export data from, at a daily granularity, and we want them to land in an S3 bucket with a year/month/day/env/region type of bucket structure.

lmello commented 11 months ago

I am building a python script to implement this feature. I might share it with opencost as an added feature to implement this without changing the current csv.

What I am doing is querying the API, converting it to parquet format due to compression and pushing it to s3.

If there is enough interest we could discuss sharing it on opencost as an implementation for this feature request.

sntxrr commented 11 months ago

I am building a python script to implement this feature. I might share it with opencost as an added feature to implement this without changing the current csv.

What I am doing is querying the API, converting it to parquet format due to compression and pushing it to s3.

If there is enough interest we could discuss sharing it on opencost as an implementation for this feature request.

I would most certainly be interested! You can find me on the CNCF slack in the #opencost channel with the same username sntxrr if you want to chat further.

mattray commented 9 months ago

Dropping a link to https://github.com/opencost/opencost-parquet-exporter as an option once code lands there