Storing contents of Golem messages in dedicated tables

cameel commented 6 years ago

Create the following tables for storing the contents of the corresponding Golem messages:

StoredTaskToCompute
StoredReportComputedTask
StoredAckReportComputedTask
StoredRejectReportComputedTask
StoredSubtaskResultsAccepted
StoredSubtaskResultsRejected

Update the subtask table (#140) to replace foreign keys to StoredMessage with foreign keys to specific tables.

Create a base model class to keep the common data like timestamp and signature. Decide if it's better to have an abstract or a concrete model as a base.

All data should be stored in a relational manner. Nested messages should be replaced with foreign keys to other tables (create more tables if needed). Nested lists and dictonaries should be separate tables.

Create helpers for converting Golem messages to model instances and vice versa.

kbeker commented 6 years ago

I think we should add few more tables to store more messages. For this moment we store in database messages for example such as ForceGetTaskResultUpload which contains FileTransferToken and ForceGetTaskResults. Also RejectReportComputedTask have 3 nested messages: TaskToCompute which you mentioned above, but also CannotComputeTask and TaskFailure. I have listed below all tables which for this moment i think we should implement:

StoredForceGetTaskResultUpload, StoredForceGetTaskResult, StoredFileTransferToken, StoredCannotComputeTask, StoredTaskFailure, StoredVerdictReportComputedTask, StoredForceReportComputedTask

If I will find another one to add I just add new comment

kbeker commented 6 years ago

@cameel What about these tables which I mentioned above? I see that we don't have in Subtask table place for messages such as TaskFailure or CannotComputeTask. Do we want to store in separate table FileTransferToken message or ComputeTaskDef?

cameel commented 6 years ago

The issue lists only the tables needed to store messages that the Subtask table links to. Because those are the only messages Concent is supposed to store.

For other messages, e.g. kept for audit purposes it's optional. I also have nothing against creating a table for every message but we don't have to create them if we're not storing them.

As for nested messages, like TaskFailure, of course these should have tables too. That's what I meant by "All data should be stored in a relational manner. Nested messages should be replaced with foreign keys to other tables (create more tables if needed). Nested lists and dictonaries should be separate tables".

As for ComputeTaskDef, I said in the description that you should have separate tables for dicts, but honestly it's just fine to flatten the structure. Generally, if you have something like this in a message:

{
    'a': {
        'c': {
            'f': 1,
        },
    },
    'b': {
        'd': 2,
        'e': 3,
    }
}

it's fine to flatten it and store it in a single table like this:

{
    'd': 2,
    'e': 3,
    'f': 1,
}

because you can easily recreate the structure when deserializing it.

Lists are a different story though. To store a list of something you need a separate table with foreign key to the original table. And you need a field to indicate position on the list. That's how you do lists in SQL.

kbeker commented 6 years ago

@cameel I was thinking about creating built-in method in model to convert lists, dictionaries to simple string, and store them in TextField() and in another way to convert string to lists or dictionaries and send to view or wherever we needs them. It might be helpful in storing big and expanded dictionaries or lists in one field.

cameel commented 6 years ago

No. The whole point of this task is to avoid storing structured data in one field :) What you propose does not give us anything over the current approach of storing the whole serialized message in a binary field.

There are two big problems with storing data like this: 1) It's a black box from the point of view of the datatabase. I you have e.g. {'a': 1, 'b': 2} stored in a string you can do only string operations on it. To do anything more you have to fetch the data from the database server into your memory and decode it.

This means that you cannot do queries on your data. You can't for example do a `SELECT` query that finds all records with `a > 5`. Or you cannot tell the database to give you records sorted by the value of `a+b`. You have to load them all into your program, decode them one by one and do the checking and sorting yourself. This is slow and eats up a lot of memory. First, our code is written in Python and, while adequate for web stuff, it's too slow at crunching large amounts of data. The usual solution is to offload the processing to something faster - like a database :). Second, databases can grow very large. What if you can't fit all that data in memory? You have to load it bit by bit. Imagine trying to sort the list of all records while not having enough memory to just go over it all in a loop. There are ways to do it, but I don't think we want to spend time reinventing the wheel when the SQL already handles it perfectly.

2) It has no schema. Every field in the database has its type and constraints (like non-NULL, unique, maximum value, or even custom conditions). The number of fields is also defined beforehand. If you try to insert a row that does not satisfy these constraints, you get an error. And an error is much preferable to accepting broken data. If you store it as a string, you lose all this.

Migrating data across schema changes is really hard. Sometimes impossible to do without malforming it. E.g. how do you pefectly migrate a database with a `name` field to one that has `first_name` and `second_name` without breaking data in any corner case? Still, it's much easier when you have a proper schema and can make assumptions about what can't be there. You have a constraint that `name` always contains only a single space? Great, now you can easily split it into two fields. You also don't have to worry that version 1.2.3 had a bug and sometimes called the field `NAme` or that it was missing half the fields. A proper schema lets you catch these problems earlier, when you try to insert data, rather than later when it's too late to fix it.

There is a solution that would let you store the data in a single field without the problems listed in (1). Django supports the PostgreSQL-specific ArrayField and JSONField. But it still does not solve (2). The point of this task is to make our data fully relational so that we can use the features of our relational database.

golemfactory / concent

Storing contents of Golem messages in dedicated tables #141