Database Discussion - Githubissues

designermonkey commented 11 years ago

With the new project, and the fact we're using a new way of modelling the database (Eloquent), the question has come up about how we structure our tables.

Now, currently, we have a super efficient, super scary abstract structure which without a lot of experience in staring at it trying to denote the sorcery that gave birth to it's arcane structure is very very hard to manage outside of Symphony. With Eloquent, we will be relinquishing some of that abstraction to the ORM, which from a programming and storing point of view is a very good idea.

This doesn't mean though that we need to fully leave our abstract structure behind us, and it's with that in mind that I've been doing some playing and thinking about what would be the easiest, safest and efficient way of utilising our method of working (Sections, Fields, Entries etc) with a less abstract data model.

With Eloquent (if you've not read about it, I suggest you do, it's really good) we would be building a Model for every table in the DB, and would also have our fields do the same.

Currently, we make an entry in the fields table for each Field, and then have the field_fieldtype table for meta data for that Field. Then, we add a data table for every field instance, using an ID for reference in the name. While very abstract, this is going to be a nightmare in the Next project.

Now. I'm thinking that with the new Models, we should continue to have a fields table, but instead of a single Field instance table, and many tables for data, we have a Field schema table and a Field data table. Like so:

tbl_fields
tbl_fields_schema_selectbox_link
tbl_fields_data_selectbox_link

In the schema and data tables, there would be an id column, for uniqueness, and a field_id column to reference the relevant Field, and in, which would allow multiple value fields to store their multiple values in the same table. Also in the data table would obviously be an entry_id. I've added some SQL statements below to illustrate this in practice (although I hear we may not need SQL in Laravel?)...

    CREATE  TABLE IF NOT EXISTS `sym_fields_schema_selectbox_link` (
        `id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT ,
        `field_id` INT(11) UNSIGNED NOT NULL ,
        `allow_multiple` ENUM('yes','no') CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci' NOT NULL DEFAULT 'no' ,
        `show_association` ENUM('yes','no') CHARACTER SET 'utf8' COLLATE 'utf8_unicode_ci' NOT NULL DEFAULT 'yes' ,
        `related_field_id` INT(11) UNSIGNED NOT NULL ,
        `limit` INT(4) UNSIGNED NOT NULL DEFAULT 20 ,
        PRIMARY KEY (`id`) ,
        INDEX `field_id` (`field_id`)
    )
    ENGINE = MyISAM
    DEFAULT CHARACTER SET = utf8
    COLLATE = utf8_unicode_ci;

    CREATE  TABLE IF NOT EXISTS `symphony_local`.`sym_fields_data_selectbox_link` (
        `data_id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT ,
        `field_id` INT(11) UNSIGNED NOT NULL ,
        `entry_id` INT(11) UNSIGNED NOT NULL ,
        `related_entry_id` INT(11) UNSIGNED NULL DEFAULT NULL ,
        PRIMARY KEY (`data_id`) ,
        KEY `entry_id` (`entry_id` ASC) ,
        KEY `relation_id` (`related_entry_id` ASC) ,
    ENGINE = MyISAM
    DEFAULT CHARACTER SET = utf8
    COLLATE = utf8_unicode_ci;

bpierre commented 11 years ago

although I hear we may not need SQL in Laravel

Yes, you should use the schema builder for this task: http://four.laravel.com/docs/schema

designermonkey commented 11 years ago

I'm looking forward to something like that, just jotting it down here :o)

ijy commented 11 years ago

That sounds like a much better way of approaching field handling in the database. The current Symphony approach of creating a new table for each field instance seemed a little....strange....

Also don't forget about migrations for any DB changes. And possibly even generators for some really quick setups. :)

streaps commented 11 years ago

noSQL?

ijy commented 11 years ago

Good call. I was just thinking the same in terms of possible integration with other database types such as MongoDB. I'm not sure what Laravel provides in terms of support though so I guess that would be the benchmark limitation.

designermonkey commented 11 years ago

Currently Laravel supports four database systems: MySQL, Postgres, SQLite, and SQL Server.

I would hazard a guess the we can expect this to change. One of the benefits of using Eloquent, Schema and Migrations is the separation from SQL operations.

designermonkey commented 11 years ago

I've added some Laravel Schema definitions to a gist of my take on the new database structure. I'm not expecting all of this to be used by any means, and what is here will probably change as we define the project, but it's a starting point for us.

https://gist.github.com/designermonkey/5646377

iwyg commented 11 years ago

forget extensions and delegates, they'll be obsolete since we have a real pubsub system. Extensions shouldn't be registered on the database either.

designermonkey commented 11 years ago

That's what I thought.

I have lots more to push today.

bpierre commented 11 years ago

Thanks for doing this @designermonkey.

Also, I would love to see the Pages and Sections removed from the database, like the Data Sources. These are the two main problems I have to deal with when working with a team on a Symphony project: they make the sync. of the DB export mandatory, and conflicts are almost impossible to merge manually.

designermonkey commented 11 years ago

I imagine they will be, but for the beginning we could stick to what we know and scope it out to be in the database.

I am a big believer that it should be XML based, and will be pushing for that myself.

bpierre commented 11 years ago

Agree, wise approach.

designermonkey commented 11 years ago

Gist updated

https://gist.github.com/designermonkey/5646377

bernardodiasc commented 11 years ago

On entries (or everything), is it possible to work with SHA-1 timestamp on primary keys instead of auto-increments integer? This keys possibly open a lot doors (versioning, sync, pluggable stuffs)...

designermonkey commented 11 years ago

I did a little research about this a while back, when we were discussing IDs and UIDs for XML based schemas, and how to match them throughout the schemas and database, and the one point stuck out:

Using UIDs in a database goes against it's natural method of relational data, and is therefore slower at indexing and returning data, although this speed decrease isn't noticeable until you get to using 1000s of rows in a table.

This being said, it would be a great thing to do to relate data inside and outside of the database. Any XML schemas will need a database record to ensure they are the same file if any of it's values are changed.

I'll try and find the resources again that I was reading.

designermonkey commented 11 years ago

I've renamed this thread to allow discussion on more DB topics

designermonkey commented 11 years ago

I've been researching again, and from what I see my comment above stands. There are advantages to using GUIDs in place of IDs, one of which is database replication, which becomes really easy to maintain when doing away with RDBMS generated IDs.

It all comes down to speed and performance, against portability of data across systems. As we want to make a system that allows data to be packaged and used across different installs (passing sections around for example) then GUIDs are the way to go.

One of the arguments against it, raised by @brendo a while back (although I forget where) was using IDs in URLs like http://sitename.com/post/21/ vs http://sitename.com/post/eFg56yjk/, both of which in my mind are bad application design.

Another issue we would encounter would be that Eloquent and Laravel are designed to use RDBMS generated IDs when inserting data, joining tables, ORM etc, so we would have a headache there. I'm all for GUIDs personally, and have wanted to do it for a while, but we have to weigh up the pros and cons, and whether we could coerce Laravel framework to use them fluently instead of IDs.

designermonkey commented 11 years ago

Actually, thinking about my comment about Eloquent and Laravel, I don't think it would be an issue come to think of it, they would just rely on an auto increment value, where we would be passing the value into a field without an auto increment.

It should be fine then. We just have to weigh up the pros and cons, and what we actually want to achieve here.

+1 from me though :o)

ijy commented 11 years ago

Some useful opinion and some pros and cons:

Seems like it has more benefits than disadvantages.

It looks like Eloquent can also be extended when getting and setting ID:

http://forums.laravel.io/viewtopic.php?id=1414

And PHP has a ready made function for this in uniqid.

designermonkey commented 11 years ago

Funny, those are two of the articles I just read :o)

Something simple like

$seed = 'JvKnrQWPsThuJteNQAuH'; // using an application salt would be better
$hash = sha1(uniqid($seed . mt_rand(), true));

takes no time at all to produce a completely unique hash. And then, even if we only use the first 6-10 characters, we won't get any collisions.

ijy commented 11 years ago

Ha, being top of Google gets results. ;)

I've never really thought about GUID before (never had the need) but now it's been mentioned I can see it's benefits. As you say, I can't really think of many use cases where the ID is called in the URL and the benefits of having truly unique entry ID's does seem to outweigh a few extra bytes to store them.

Laravel uses an Application key which could be used for the seed. Although that may just be in the full-stack version... Mcrypt would probably do the job anyway.

designermonkey commented 11 years ago

The only reason this has ever come up with Symphony in the past is when we were preparing to migrate the structure; Sections, Fields, Pages, to XML. To do that and allow portability between systems would need truly unique identifiers, and a system that could manage what to do when they were different, so a database table of records of the unique IDs that were loaded, and a check of the filesystem folders in which they resided to check for changes/additions.

In my opinion, if we are moving to this structure (which we are) it makes sense to change all of the things to GUIDs for ease of implementation. Files should have a GUID, and a 'change' GUID too to check revisions, but we have a thread all about that on the Symphony 2 repo.

Edit: here's the link for reference: https://github.com/symphonycms/symphony-2/issues/1002

creativedutchmen commented 11 years ago

There are advantages to using GUIDs in place of IDs, one of which is database replication, which becomes really easy to maintain when doing away with RDBMS generated IDs.

Replication isn't that hard with ID's either. In mysql, for example, you can use the auto_increment_increment config, which will make sure there are no more conflicts with ID's.

Using GUIDs will be a solution to a few problems, but merging database records will always be a can of worms.

jonmifsud commented 11 years ago

The auto increment offsets, can indeed work magic. I've been set-up like this; and I don't have issues developing in a multi-developer environment. We have separate offsets so merging data is not that complex; I've done it with the current setup; the thing is that you would need to specifically set up all mysql instances before you start working.

On 29 May 2013 14:04, Huib Keemink notifications@github.com wrote:

There are advantages to using GUIDs in place of IDs, one of which is database replication, which becomes really easy to maintain when doing away with RDBMS generated IDs.

Replication isn't that hard with ID's either. In mysql, for example, you can use the auto_increment_incrementhttp://dev.mysql.com/doc/refman/5.0/en/replication-options-master.html#sysvar_auto_increment_incrementconfig, which will make sure there are no more conflicts with ID's.

Using GUIDs will be a solution to a few problems, but merging database records will always be a can of worms.

— Reply to this email directly or view it on GitHubhttps://github.com/symphonycms/symphony-next/issues/12#issuecomment-18611758 .

creativedutchmen commented 11 years ago

This discussion has been on my mind for a while now, and the more I think about it, the more I think we are headed in the wrong direction.

The biggest problem with the database design as it stands now is that it is very hard to understand and query (multiple joins, having to know exactly which id is what field, etc). We could do away with one problem by naming the tables better as @designermonkey pointed out, but why wouldn't we merge all these tables into one?

Imagine selecting an entry from the articles section to be as easy as:

SELECT id, title.*, description.* from `articles` where `published` = 'yes'

Doing it like this would also mean that we can use Eloquent as it is meant. Thoughts?

designermonkey commented 11 years ago

The concept of abstract sections and fields is one that bothered me for a long time when first looking at Symphony, but over the time I've used it, I understand why it is a necessity in a way.

Normalising it as much as possible is required to move forward, but to do what we know and love in Symphony is to use a method like I outline above.

We wouldn't need too many joins, as each table is a Model in Eloquent, meaning that each field would need a Model for each of it's tables, and the PHP would join the data together in an overarching Model class (not extending Eloquent), containing functions to get the relevant data from each Model.

michael-e commented 11 years ago

I understand why it is a necessity in a way

I don't. I am working a lot in the database directly, so I can find my way. But I don't really understand why this level of normalization is better than having all fields of a section in a single table (like proposed by @creativedutchmen).

However, I remember that (when Symphony 2 started) there were discussions about reducing the normalization level, but in the end it was considered the better solution. Maybe @allen can tell us the reasons. Or maybe you can, @designermonkey?

designermonkey commented 11 years ago

We have two concepts at play here (for those reading this that don't get it), abstraction and normalisation. Abstraction generally makes a database or system more manageable by software like PHP, whereas normalisation makes it more manageable by users. In the case of Symphony, the database is entirely managed by the software, and in itself has some abstract concepts at play.

Symphony build Section and Fields as though they were tables in a database, albeit with a lovely GUI for users to interact with them. This lets users build systems as abstract or normal as the end user needs them to be. To have this work efficiently from a code perspective, abstraction at a database level is a must, otherwise it gets complicated with more points of possible failure from the code interacting with the database.

Normalised database

Managing a database where the underlying schema is normalised to a user's understanding level is very hard to manage from a code perspective, as the schema can be changed at a whim to whatever the Section desires, by adding columns for Fields whenever the developer wants. This is a headache, almost a migraine, as custom SQL would be required for every alteration to the schema. There could be no vanilla approach to the SQL, and we would need to join SQL snippets much like we currently do with JOIN statements, but not as JOINs, rather snippets to build a standard SELECT. This is inherently a bad idea, and is what JOINs are for. We would also need to incorporate a lot of ALTER TABLE statements to manage the changing schema, and they are not the same across SQL dialects. Indexing data that changes by schema so often is also complicated, as we would have to add and remove indexes to the same scale as columns. This all puts a massive overhead on the database interaction for a system like Symphony.

For a system that will have a fixed schema, like Wordpress et al, this isn't a problem, but for a schema creation system like Symphony? Bad idea.

Abstracted database

It's much easier to manage a database with a set schema that employs using rows over columns to configure a system's abstract concept of Sections and Fields, as this data can be transported easier between systems (especially if using guids) whereas transporting data from one altered schema to another becomes a massive headache. A fixed schema means that any extensions, plugins etc will inherently know what they are dealing with from a code perspective, and can adjust data in rows within that fixed schema. These rows can be indexed efficiently as well, as they are rows, not schemata.

Symphony is the database as far as the user is concerned, the database in reality becomes just an abstract storage engine. We have to remind ourselves that our level of abstraction and customisation comes at a price, and that is normalisation at RDBMS level. It's just not possible to efficiently manage this.

We can of course normalise it to some degree, making the understanding of the abstract concept easier (as I propose above) but complete normalisation isn't possible for a concept like Symphony.

God knows I've argued with a Symphony database a lot and sometimes hated the concept, but using it over and over again has taught me why it is needed. It's a toss up between system level abstraction and customisation vs database normalisation and understanding. We can't have both, and it is one of the reasons a system like Symphony is so powerful.

What I propose is to remove that initial WTF??? moment when looking at a Symphony database, and merging all the entry data tables into one per field, we can't however, remove all of them to a single schema for each section, as we will lose that abstract ability we have in the admin interface.

creativedutchmen commented 11 years ago

normalisation makes it more manageable by users

No it isn't. Normalisation is about having every bit of data in a single location, and linking to the data rather than copying it (gross simplification, I know!). I would say that the current database design is highly normalised, but it is far from easy for a user to understand.

In fact, normalised databases are much easier to work with from within code, as the database itself can handle much of the additional tasks (deleting children when parent is deleted, for example), and the risk of corrupt/incomplete/missing data is reduced.

To be honest I do not understand your point about the JOINs. Can you elaborate a bit please?

I do appreciate your concern with index recreation whenever a section is modified, but keep in mind these are only very time consuming when the tables you're dealing with are huge. Mine certainly aren't, especially not in the period where schema changes often occur: in early development. To be honest I would prefer a much simpler layout to work with, even if that means I have to do a huge migration in the weekends or at night.

creativedutchmen commented 11 years ago

As a sidenote: we do not have to write ALTER TABLEs ourselves. That's exactly what migrations in Eloquent are for. Since Eloquent is meant to be used for these kind of tables, all the nightmares are handled for us.

michael-e commented 11 years ago

I try to better understand this.

@designermonkey:

This is a headache, almost a migraine, as custom SQL would be required for every alteration to the schema.

Do I understand this right? If every simple field was allowed to insert columns in a "section table", that would be potentially complicated or even dangerous? (Unless "migrations" would handle this, as @creativedutchmen says.) Correct?

For a system that will have a fixed schema, like Wordpress et al, this isn't a problem, but for a schema creation system like Symphony? Bad idea.

Same point as above, right?

Generally, it might be a good idea to do some research on how experienced Laravel users organize data in the database.

iwyg commented 11 years ago

I'd take a look at the DB structure of craftcms.

designermonkey commented 11 years ago

I would say that the current database design is highly normalised

I would disagree, as all of the data isn't in one single place, it's in many multiple tables, to join together and create an entry. Normalised would mean that a single entry is in a single table, as you're proposing. Normalisation, adding data in one location, makes it manageable by users. We're arguing the same point really.

Can you elaborate a bit please?

Yep. We currently JOIN lots of tables together to build a representation of a single entry, using a changeable schema would mean stringing together lots of bits of SQL to allow us to reference all of the possible alterations we make on the schema. What if we get a string wrong, especially when creating the schema?

Anyway, I have had this explained to me by @allen and @brendo before as to why it was done this way. I can't find them in my emails though. Also, If I remember rightly, they explained it was done the way @creativedutchmen is proposing while trialling, but it didn't work right, or something like that. I won't speak for others, as I can't fully remember the conversations.

jonmifsud commented 11 years ago

Normalised would mean that a single entry is in a single table

@designermonkey are you sure? For me normalization means reduction of data replication. That is as long as the data is available in a single place (one table) it doesn't matter with how many others it is joined. So having the First Name in one table; and the last name in another is completely acceptable as long as there is not another table containing the same user's first/last name. I'd say symphony as a 'core' is very normalized.

The way symphony is built this step falls far more on the developer building systems with Symphony rather then symphony itself.

adding data in one location, makes it manageable by users

Symphony already allows me to do this; weather I got 1 table or a 100; what users will ever interact with is the Symphony backend; no matter how many tables/links we generate in the background for the user it's always a single entry. If on the other hand you're looking at developers that's a totally different matter.

I've worked on extensions and datasources which pulled up entry data from the backend and in all honesty it's not as complicated. Maybe what would make things simpler is a better API wrapper to make the interaction with database easier and more manageable for developers, who do not want to delve into the Symphony Database structure.

designermonkey commented 11 years ago

Well I'm obviously getting my terminology wrong, but that's not the point I'm trying to make. It's just semantics.

designermonkey commented 11 years ago

Swap normalised for simplified and it makes more sense then.

creativedutchmen commented 11 years ago

I would disagree, as all of the data isn't in one single place, it's in many multiple tables, to join together and create an entry. Normalised would mean that a single entry is in a single table, as you're proposing.

~~No, that's not what it's about. An excerpt from wikipedia:~~

Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using the defined relationships.

Normalisation is about storing every bit of data in one place. For instance, if you are building a customer/purchases database this means that you don't store your user information with each order, but you store your customer and link the order to it. So when a customer later updates his email address, you update it in a single place, not on every order he ever placed.

Edit: oops, I am too late to the party with this...

using a changeable schema would mean stringing together lots of bits of SQL to allow us to reference all of the possible alterations we make on the schema.

Not really. Every change in the schema would be catered for by a migration, which is completely separate from the way the entries are fetched. Because the schema is very simple and clean between migrations it means filtering is quite easy. You could do something like Posts::where('title.handle', '=', 'my-title')->get(); or even Posts::::where('created_at', '>=', date('Y-m-d'))->where('title.handle', 'LIKE', '%something%')->get(). As you can see, there is no need to mess with SQL ourselves, as this is why we want an ORM like Eloquent in the first place. They are built for this.

Because the schema is reflected in the Models we will be using, the ORM will know how to build the queries, even if they get more complex and involve relations.

The biggest reason I can think of why Symphony hasn't used this kind of schema yet is that the migration tools and the ORM are very time-consuming projects. But since they have been developed for us, I think it would be very wise to just build upon them and not try to reinvent the wheel!

michael-e commented 11 years ago

not try to reinvent the wheel

Wise words. If we trust in the framework (and we must do this), we should use as many parts of it as possible. This means "delegating work to someone/something we trust", as opposed to "use a bit of it, but try and do the rest better anyway". In my experience the latter generally means a lot of work with little outcome.

designermonkey commented 11 years ago

I'm in complete agreement that we should use ORM, and therefore Eloquent, and therefore not re-invent the wheel, but you're both missing the point, which is: We are not building a simplified data structure, we are allowing developers to build simple and complex data structures, and these need a more complex data structure at the RDBMS level, which is why Symphony is currently built the way it is.

The problem we currently have with ORM is that our PHP code has no ORM built in, it has nothing to do with the database structure. The database structure is the way it is because our developers use the Symphony admin to build what is best described as a pseudo RDBMS on top of a real RDBMS.

Also, @creativedutchmen I don't understand where, in your example, you are getting title.handle from? What is the dot in aid of and what does it represent?

designermonkey commented 11 years ago

An ORM in this context takes objects (rows) from a single database table, and relates them in object terms to another single object (row) from another single database table, using Models. A Model is a representation of a single database table. And, if the relationship requires it, a pivot table is used (think section_associations) and the ORM just manages that link. It's code based, not schema based which is what we're discussing here.

This has nothing to do with table layout in this context of what we're trying to achieve, and what we currently achieve with Symphony. We have to continue to think about the bigger picture here of what Symphony allows developers to do.

designermonkey commented 11 years ago

(Sorry for going on and on about this)

A Field would be a model (FieldCheckbox), and would have it's table set to fields_schema_checkbox for example. It would extend the main Field class which in turn extends Model (eloquent).

The main Field class would manage data using defined methods to add meta data to the fields table.

The FieldCheckbox class would have defined methods to manage insertions etc to the fields_schema_checkbox table, and also methods to get and set data in the fields_data_checkbox table. The parent classes methods for managing insertion into the fields table would need to use the parent::$table way of working to get the right table, when used via the FieldCheckbox class.

Now, granted, I havent figured out the details of how this would work, but you can start to see how a Model works in our context.

creativedutchmen commented 11 years ago

We are not building a simplified data structure, we are allowing developers to build simple and complex data structures, and these need a more complex data structure at the RDBMS level, which is why Symphony is currently built the way it is.

I disagree here. Even though we are building complex structures, these structures can easily be built using the schema I proposed. There is absolutely no need to abstract the data as far as you are proposing, because the tools that make a simple schema work have already been written.

Just imagine having a single table for each section, and that each entry is a row in that table, with each field being a column. Doesn't this scream Symphony to you? To me it does, because it's exactly the way the backend itself is laid out. Now, to say we are building structures that are too complex for this sounds strange to me, since that would mean they no longer fit within Symphony.

I don't understand where, in your example, you are getting title.handle from?

It is an example. Most fields will have a few bits of data they need to store, for a textfield this is value and handle. It made sense to me to prefix the field name with the attribute. But, of course, this is just an example. title_value could work just as well, or titleValue, or whatever. The title.handle would simply be the name of the column in the database.

designermonkey commented 11 years ago

I see what you mean with the dot now.

I think we need @brendo's input here.

jonmifsud commented 11 years ago

Seems to make sense to me; with some slight concerns.

What will happen when you have very large sections. Say I've read on the forum some people have over 70 fields in a section at times. Could this somewhat impact performance especially when you do not need to require all fields in a datasource? Will it effect indexing when an entry is created?
How would it handle adding new fields later-on? The current database structure in Symphony just wouldn't create the links with effecting tables/entries in their original state. Using this approach you would either have to set a default value; or else set the value(s) as null. Not sure if this could/would change behaviour after the update(s). I'd be especially concerned as the number of changes I/my team get to do post launch is quite substantial.

designermonkey commented 11 years ago

How ever I look at this, what @creativedutchmen is proposing just won't work. My Address Location Field for instance; The field itself has 14 columns alone, not to mention the meta data in the fields table.

creativedutchmen commented 11 years ago

Sorry for going on and on about this

I think it's an important discussion, and it's much better to have it here than in a chatroom or email conversation where we will lose it in a while :)

Let me add though that every schema we pick now will have its benefits and downsides. I am a big fan of putting the priority on read performance and simplicity, but I might be taking it too far when it comes to writes and schema changes, I don't know.

Say I've read on the forum some people have over 70 fields in a section at times.

Having that many fields is a sign that the section itself can be normalised quite a bit. Simply because it is possible shouldn't mean you should also do it :)

Using this approach you would either have to set a default value; or else set the value(s) as null.

Yes. This could definitely be a performance killer when you're adding a field with a default value to a complex section with hundreds of thousands of records, as the default value will have to be inserted for each of these entries. Now, I know this is not pleasant, but keep in mind the migration that deals with this can be executed in the weekends or in the evening

The thing I really like about this approach is that it closely resembles a database itself, so everybody with experience in database design will know how data can be structured better when performance problems occur, simply because the same rules apply (the database maps almost 1:1 to what you see in Symphony).

creativedutchmen commented 11 years ago

My Address Location Field for instance; The field itself has 14 columns alone, not to mention the meta data in the fields table.

Sure, other things won't fit either: many to many relations, for example. But the good thing is: it doesn't matter.

Simply put, every field should make sure its own data is properly normalised. When your extension needs 14 columns, this is a sign that you could group things, and put them in a separate table. There is absolutely nothing wrong with this.

The many-to-many relation need a pivot table for the entries table to remain sane, so it should create one. But much simpler fields like a checkbox, file upload, textbox do not need an extra table, and I think nobody who would create a website using Laravel from scratch would put them in one.

allen commented 11 years ago

What John has said is accurate and reflects the design decisions we've made with Symphony to this point.

So to put things succinctly:

Symphony is highly normalised. It was designed to adhere to 4NF. That said, it's not been a mandate, so I'm not sure if at some point it's reverted to 3NF.
The aim of DB normalisation is to reduce data redundancy and thus reduce possible errors.
It's important to note though that our decision was made when ORM wasn't a factor.

A way to think about normalised DB is to think of each table as lego blocks. At any point, Symphony can join pieces of lego blocks to form a cohesive data chunk. It can also drop pieces of Lego blocks if it's no longer needed. A less normalised structure is akin to chopping up a block that is meant to be whole.

The analogy can be extended to thinking about performance too. It's easier to fetch a single block than it is to fetch pieces of associated blocks and then having to combine them on the go.

So the main reason why we chose the higher normal form for Symphony is to do with our thinking about large numbers.

First, we tested between large number of columns versus large number of tables. Our tests showed that there was no significant difference in performance when going with either. However, there were differences when it comes to fetch and alter requests.

Not surprisingly, it's faster for DB to select from a single, albeit wide table but when it comes to altering the table, performance took a significant hit. At one point I think we even managed to corrupt the table when altering it.

With the highly normalised table structure, select queries performed several notches below the alternative. However, as we reduced the number of joins required (simulating fewer included elements in XML), performance started to equalise. Since each table in higher normalised form are all much leaner, there were situations where it was faster than a wide table. That said, this kind of performance gain is not something you can really bank on since we just don't know how the system is going to be used. When it came to altering tables, it was quick and easy to do, given the highly normalised nature of the structure.

At one point, we did consider a hybrid approach; core fields were combined, while extension fields were normalised and joined. This idea went on the way side when we decided all fields in Symphony were equal citizens.

The nail in the coffin for us really came down to extensions. Knowing that the system did not have to worry about extensions messing up a single shared table is something we found very comforting.

It's again important to point out that our decisions were made when ORM was not considered.

allen commented 11 years ago

I've missed an important paragraph above. When it comes down to thinking about large numbers, the complexity of the "monolithic table" started to approach on the "small but many tables" structure. Select statements for the monolithic table became more and more sluggish, with fewer optimisation routes possible. With the, "small but many" table structure, we were able to isolate problematic areas and selectively optimise it.

Lastly, to solve the performance problem with highly normalised table structure, I proposed the index table solution back during Symposium 2010. A data source can optionally generate an index table. It combines all the required fields into a single table, thus eliminating all joins. As these are just index tables, they can be safely altered or purged without affecting the original table structure. These tables would be super fast and maintain the flexibility offered by a database, as opposed to static cache file solutions. These index tables are essentially a less normalised form of the originating table, that offers the performance benefit, yet avoid the issues of altering large, monolithic tables.

brendo commented 11 years ago

Allen said a lot of what was in my head. The thing that is yet to be determined is how extensions are going to play a part in Next. Having one table that extensions modify is a risk (I know this is not done directly, migrations/ORM would handle it most of the time) but still the possibility that installing an extension may corrupt the entire table is one that I imagine most users would like to avoid.

At the moment, because extensions are responsible for managing their own tables/structures, the risk is mitigated. Sure your site might come down with an exception because of a failed MySQL query, but generally speaking, your data is safe.

ORM isn't a magic bullet. It shifts the responsibly from individuals to those that manage the ORM code, so while it's fine to say "the ORM handles it", at the end of the day this is still going to invoke an SQL query that cannot be optimised. I'm actually anticipating a performance drop when switching to an ORM, but without numbers it's just a gut feeling at the moment :)

symphonycms / symphony-next

Database Discussion #12

Normalised database

Abstracted database