sources_sourceindexationoutcome table is growing

ietf-tools / bibxml-service

Django-based Web service implementing IETF BibXML APIs

https://bib.ietf.org

BSD 3-Clause "New" or "Revised" License

17 stars 19 forks source link

sources_sourceindexationoutcome table is growing #357

Closed kesara closed 1 year ago

kesara commented 1 year ago

Describe the issue

The BibXML service stores the outcome of the indexing tasks in sources_sourceindexationoutcome table. The indexing tasks run multiple times a day so the current table size of this database is over 8GB.

I think this is something that should be logged in the filesystem, rather than stored in the database. This gives us the option to ignore or rotate the logs.

Code of Conduct

[X] I agree to follow the IETF's Code of Conduct

stefanomunarini commented 1 year ago

Hi @kesara , we had a brief internal discussion, and thought that perhaps we could delete old records instead?

We could, for example, delete records older than a X amount of time at the time a new SourceIndexationOutcome entry is added to the database (thus avoiding the need of implementing a new scheduled job, e.g. a weekly job, responsible for deleting records older than a week time).

How do you see this approach?

We are, of course, open to move logs to the filesystem if a good reason is provided.

kesara commented 1 year ago

@stefanomunarini I am happy with that approach.

strogonoff commented 1 year ago

@kesara To expand on the reason results are stored in the database by original design, it’s that management GUI (and potentially API) exposes those results. Not sure anyone’s using it, but I did use it for troubleshooting during some phases of development.

It’s feasible that we may want to simplify management GUI to omit that feature, but that could create a problem—if someone is working on the indexation logic and wants to see production outcomes, they might need to nudge you every time or get direct access to logs somehow. If relevant logs would be exposed in an easy to dig up manner (e.g., via Sentry), then I guess we could omit logging indexation process/outcomes to database. But as it is perhaps trimming the table would make more sense.

kesara commented 1 year ago

@strogonoff No one is using that user interface for troubleshooting. The only use case right now is monitoring the reindexing statuses. I think it's okay to have it logged without providing web access since the queries around these sorts of issues are rare.

strogonoff commented 1 year ago

My 2 cents: this was an intentional design decision, and “no errors occurred so far” looks like a flawed reason for removing functionality that facilitates error resolution[0].

Considering that indexation is essential for this service’s functionality, and any related issues in production (of which there used to be a few at earlier stages) are both difficult to quickly reproduce locally and need to be resolved quickly, removing a way for developers to see immediately what’s wrong should be done with awareness that should indexation issues (including data schema compatibility issues etc.) occur, degradation of functionality may be prolonged.

[0] Unless that functionality is replaced by something equivalent, which if I understand correctly “logging to production server’s filesystem that developers don’t have (or should not have) direct access to” does not seem to be…

stefanomunarini commented 1 year ago

That's a good point @strogonoff . If @kesara can confirm the above approach is still the desired one, I will proceed at its implementation. We can, of course, have a more thoughtful discussion about replacing it with something equivalent, as proposed by Anton above, if needed.

kesara commented 1 year ago

I'm okay with keeping it as it is since there are no errors. Something to consider here is bib.ietf.org runs indexing continuously. This is because we want to make available any new IDs and RFCs as soon as possible.