Some doubts about the ease of development of the current architecture

jpmckinney commented 4 years ago

In short: we have a lot of our own code, for functionality that we'd get for free from Django, but our code is not as well architected as Django, making this repo harder to develop, and it is less familiar in layout and design than Django, making it harder for other developers to pick up.

database.py and models.py together do a lot of work to setup the ORM, seemingly without taking full advantage of SQLAlchemy's features; i.e. we use classical mappings in database.py, then create our own simple classes in models.py, instead of just using declarative mappings. We have 20 lines of code in database.py for a simple MyModel.objects.get(id=1) in Django (not the only example)…

As noted in #230 and #231, we do a lot of passing around of config and database variables, which makes the code harder to test, edit and maintain, because the logic is much harder to follow than if these were guaranteed to be only set once (as in Django apps).

As far as testing goes, Django offers TransactionTestCase, which is much faster than dropping and recreating the tables for every test. As we add more tests to get full coverage and to cover edge cases, it's going to take minutes to run the suite, which makes development slower and more frustrating. I tried to use the pytest-flask-sqlalchemy package to do this, but our passing around of database variables means it doesn't work, as is.

(Smaller points: We rig up our own basic CLI command, whereas in Django we could just use management commands. For what we're doing in the web API, there's not much difference between Flask and Django, but just noting that there'd be no disadvantage to Django here.)

I was planning on adding tests to get full coverage (like I've done with Toucan, Extension Explorer, etc., before doing any refactoring or development), and then incrementally making changes in focused PRs: a first change being to use a queue for one of the transforms to simplify its "early return" logic and see if there were any other improvements viz. speed or complexity. But, the architecture makes even adding tests difficult (and slow to run).

This repo isn't doing that much: storing data, checking it, performing two transforms, and allowing deletion of data. It's only 3k lines, excluding migrations – and 1k is taken up by database.py which is verbose, as mentioned.

So, I'm presently of a mind to see how far I can get re-building on Django (and, of course, using the same database schema as we have now), since it's looking like the effort to make incremental changes will exceed the effort to re-build (while keeping the parts that work). The goals are to (1) make the repo easier for others to contribute to, by (1a) not having to learn any eccentricities of this architecture and (1b) being able to carry over knowledge about Django (CoVE, Toucan, etc. use Django), (2) to implement some of the architectural changes that were planned anyway, and (3) to thereby make it easier to add new features in a responsive way in the future.

Anyhow, perhaps I'll discover that this architecture is already the better option, but I'm willing to spend the time to test my hypothesis…

robredpath commented 4 years ago

@jpmckinney Thanks for sharing your thoughts here after our call yesterday; it's great to hear what you're thinking about this!

What are you looking for from us in terms of engagement around this? We're very happy to support/talk things over/review code/look into things, if you'd like us to. I'm conscious that the current architecture was designed to meet the needs that we foresaw over the course of the next 12 months, 18 months ago - so it's definitely coming to the end of its design life. We've held off from any suggestions of re-architecting in view of the anticipated project that's on the roadmap to significantly re-engineer kingfisher-process, but it's something that we've naturally given thought to as we've been working on it. That said, a lot of careful attention has been given to some of the decisions that we've made, and I think that it would be helpful to identify those and assess their relevance to the future needs of kingfisher-process users.

We've recently build a (kind-of) similar system to Kingfisher for 360Giving recently, and we've made use of more of Django's features for that. I'd be happy to arrange a call with @michaelwood (who did that work) and yourself to look at what we've done there, and reflect on the pros and cons.

I think it's worth noting that our use of Django in CoVE is relatively lightweight and doesn't use many of Django's features, I'd be wary of overestimating the skills transferability to a Django-based Kingfisher.

jpmckinney commented 4 years ago

Thanks, @robredpath. #230 and #231 have some more specific design questions, where the current design might have been the result of careful thinking, so it'd be good to clarify those.

I agree that it'd be good to identify decisions that were made carefully and/or based on lessons learned, and to assess their relevance for future needs. How do you propose to do that?

Some initial topics might be:

Any strong preferences for SQLAlchemy? I know it's better for complex SELECT queries, but (1) we don't have many of those, (2) there are ways to do them in Django with a little effort, including simply using SQLAlchemy, and (3) for analysis, queries are run as plain SQL anyway, rather than through an ORM.
Any strong preferences for Redis? We're now using RabbitMQ for the Data Quality Tool, so it'd make our stack simpler to use RabbitMQ in Kingfisher. If we're going to use more queues as discussed elsewhere, we might want the queue to survive reboots (even if we can rebuild the queues based on the database). Redis has its AOF persistence, but it's not as durable as a message broker like RabbitMQ. We might also want a message broker's other features for monitoring and controlling the queues.
Any strong views on composing lower-level components together (Flask, SQLAlchemy, etc.) rather than using a higher-level framework? In my experience (and as it looks here), using lower-level components seems like a good idea at first, but you end up spending increasing amounts of time on the 'glue' between them, and if you're not exceedingly careful in how you write the 'glue', it will become the main source of errors and effort, rather than the features. In that respect, the lower degree of freedom when using frameworks like Django is a blessing.

I'm happy to learn from 360Giving's datastore. I don't know what specific questions to ask, but if you and @michaelwood have ideas or reflections to share, let's arrange a call.

In terms of software lifecycle, there was an original Kingfisher design 18 months ago, then a rewrite 12 months ago… and a potential rewrite now (of Process). It seems like a lot of rework.

We might not see as much skills transfer within the team, but given that Django tends to have one way of doing things (with ample documentation), I think it's still favorable to a system that each developer will have to learn in more detail, with less documentation to support.

jpmckinney commented 4 years ago

I've successfully replicated the models in Django, such that no changes to the database are necessary. I've also started to work out an all-queues architecture.

robredpath commented 4 years ago

@jpmckinney great!

On the subject on specific technology choices - I think that having discussions on GitHub (in a similar vein to #230 and #232) would be best. I think it would be helpful to ask two questions - both "what was the rationale behind the choice of this technology?" and "are there any specific ways in which this has been used which are of particular note?". I think that way people will be able to think about the choices that we've made in particular functional areas.

On the subject of wider technology choices (components vs frameworks), I think that a call could be good to discuss, as it's quite a high-level question. I think it's important to see it in terms of software maturity, as well - what was suitable for the previous iteration may not be suitable for the future.

On software lifecycle - it's been an iterative process, as you might expect for a new tool in a rapidly-developing field; the user needs that we see now were inconceivable a couple of years ago. Each iteration has been designed for approximately twice the life of the previous one - and so we should now be looking at a design, architecture, procedures and documentation that will last us 2-3 years, and support wider collaboration on the code. For example, in developing the current iteration, we deliberately went for lightweight decision-making processes and favoured features now over maintainability/develop-ability later. What was appropriate then, isn't now, and so the next version should look as different from the current code as the current code does from ocdsdata.

I've asked @michaelwood and @odscjames for availability for calls and I'll let you know when I hear back; hopefully we'll be able to get the first in this week.

robredpath commented 4 years ago

We briefly discussed two of the topics raised today:

SQLAlchemy was a historic decision; we preferred it over raw SQL early on and we've used it since - so no objection to using a different ORM.
Regarding stack simplification, was there a reason that the DQT didn't follow Kingfisher and use Redis? ie - were there RabbitMQ features that they needed that Redis didn't have an equivalent for? We chose Redis for fairly pragmatic reasons (ease of learning and inspection of the queue, mostly, along with some existing team expertise). A change of course incurs a learning cost, but we've got no strong preference either way.

I'll look at getting a call scheduled to talk about the frameworks vs component point shortly.

jpmckinney commented 4 years ago

We want our queuing backend to:

be durable, so that we don't need to implement additional tools to check/rebuild the queue
require workers to acknowledge messages, so that they can be redelivered if a worker dies
push messages to workers, rather than having workers (often uselessly) poll the broker

Regarding Redis vs RabbitMQ:

Redis can't be made as durable as RabbitMQ. Increasing Redis' durability requires additional configuration and incurs increasingly severe performance penalties.
To use Redis like RabbitMQ, you have to use Streams with consumer groups, which are about as foreign as RabbitMQ, if you're used to using Redis as a key-value store with list operations.

Datlab had already used RabbitMQ, so we went with that. Neither ODS nor Datlab (I assume) has used Redis Streams, so even in retrospect, RabbitMQ seems like the better option. Redis Streams are also only available in Redis 5, whereas it looks like only Redis 4 is in Ubuntu 18. As it's a relatively new feature, there's much less documentation and support for it than for RabbitMQ, which has nice tutorials. All that said, interactions with the queueing backend ought to be encapsulated such that most code changes wouldn't require any knowledge of the specific backend.

robredpath commented 4 years ago

That's really helpful - thanks @jpmckinney !

open-contracting / kingfisher-process

Some doubts about the ease of development of the current architecture #232