Bug: Text PK can have collisions

etiology commented 8 years ago

I'll fix this but let me describe the problem and get some info about how the PK is used.

There are two primary key strategies happening in the code base: 1) An auto-incrementing Integer 2) A Text based key that is generated by a random string of length 8

During the tests that I've ran I have found that there is a random UNIQUE constraint failed error that occurs when creating models that use the text based primary key. This makes sense and it would cause issues.

At first I thought it would be simple to normalize the primary key strategy and change everything over an integer since this is part of best practices for keys. However I found that the primary key is actually used in flashing messages to the user.

I would suggest that the models move to using an integer base primary key and we add a uuid to handle the user facing identifier. The is very little chance of getting a collision and it could be created when the model is created. We could then move all of the models over to being consistent with their primary keys. UUIDs look like this:

e9505ffb-9496-4fe6-b558-1407a7386f29

Let me know what your thoughts are on this. Like I said I'll take care of implementing it when I push the tests I'm writing for the models since I can see where they are failing.

kizniche commented 8 years ago

That sounds good. Would you also know how to upgrade the databases with alembic?

kizniche commented 8 years ago

Probably in the wrong issue here (ref to #114), but would a database merger (into one DB) necessitate the users reclone the repo, or could this be cleanly done with the current upgrade system?

etiology commented 8 years ago

I think alembic can make that smoother. I'm a bit out of practice. I was holding off on the db branch until I get that figured out. You've had more practice with alembic lately than I have so I'll have to pick your brains when things get closer.

Sent from my iPhone

On Oct 4, 2016, at 8:12 AM, Kyle Gabriel notifications@github.com wrote:

Probably in the wrong issue here (ref to #114), but would a database merger (into one DB) necessitate the users reclone the repo, or could this be cleanly done with the current upgrade system?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

kizniche commented 7 years ago

@etiology,

I've been going through and making a lot of small changes (which add up to thousands of changes overall) related to PEP8, logging, etc., and closing old issues to clear the way to work on some of these you started a while ago. I came to this issue and have been trying to wrap my head around how to go about upgrading the databases with alembic, and I'm so far stumped as to how to accomplish this.

Essentially, will we replace the 8-character ID with a uuid, then go through and replace all occurrences of any 8-char string with the new uuid, so all references remain intact? I'm no alembic wizard, so I'm having difficulty understanding how to do this.

Also, what do you mean by using an integer for a primary key? Just a very long random integer to prevent potential collisions?

etiology commented 7 years ago

Getting better at alembic is still on my list. So I can't speak much about that part.

So taking an example from the Method model:


class Method(Base):
    __tablename__ = "method"

    id = Column(TEXT, unique=True, primary_key=True)
    name = Column(TEXT)
    method_id = Column(TEXT)
    # ... etc

If I recall the method_id is where the random string was assigned right? If not just correct me. The UUID is just a string so you don't need to convert it, but it would be a great chance to change the type to String and set the nullable to False. I'll also show the primary key as an int:

from uuid import uuid4
# ... other imports

class Method(Base):
    __tablename__ = "method"

    id = Column(Integer, primary_key=True)
    name = Column(TEXT)
    method_id = Column(String, nullable=False,  default=str(uuid4()))
    # ... etc

etiology commented 7 years ago

The next step after that is to move some of the common model routines into the models. The model knows the most about what it holds and how to use it so it's the best place for certain responsibilities.

The easiest example I can think of right now is your User model:


class Users(Base):
    __tablename__ = "users"

    user_id = Column(INTEGER, primary_key=True)
    user_name = Column(VARCHAR(64), unique=True, index=True)
    user_password_hash = Column(VARCHAR(255))
    user_email = Column(VARCHAR(64), unique=True, index=True)
    user_restriction = Column(VARCHAR(64))
    user_theme = Column(VARCHAR(64))

    def __repr__(self):
        output = "<User: <name='{name}', email='{email}' is_admin='{isadmin}'>"
        return output.format(name=self.user_name,
                             email=self.user_email,
                             isadmin=bool(self.user_restriction == 'admin'))

    def set_password(self, new_password):
        self.user_password_hash = bcrypt.hashpw(new_password.encode('utf-8'), bcrypt.gensalt())

    @staticmethod
    def check_password(password, hashed_password):
        hashes_match = bcrypt.hashpw(password.encode('utf-8'), hashed_password.encode('utf-8'))
        return hashes_match

You gave it those set_password and check_password methods which is great because it is exactly something that a user object should know how to do. Maybe there is something like this for the PID controller model or something.

kizniche commented 7 years ago

Methods are handled a little differently than the sensors, PIDs, relays, timers, graphs, and LCDs.

For all but methods, the 'id' is the where the unique id is stored to distinguish different controllers. For methods, the 'id' is still used to distinguish between, modify, and delete entries, but the 'method_id' is used to associate entries. So, if a method is created, an entry will be made with a random 'id' and random 'method_id' and will contain basic info about the method (name, type, etc.). Then, when a time period/point is added to the method, an entry is created with a random 'id' but the 'method_id' will match that of the previous parent entry's 'method_id'. So, 'id' should be unique, but 'method_id' should not be.

etiology commented 7 years ago

Is method_id a foreign key then?

kizniche commented 7 years ago

Yes, it will reference the 'method_id' of the only entry with the method_order of 0. Here's some data to illustrate.

mycodo sql example methods

kizniche commented 7 years ago

I've been thinking about what would be necessary for a successful database upgrade. The current IDs from each table will need to be searched for and replaced throughout the database to preserve references. Then the column type needs to be changed, then assigned the new unique integer.

However, I found sqlite3 column types can't be altered. So, unless there's a new method I haven't discovered, I'm coming to the conclusion that having users create a new database will probably be necessary.

I believe this should be a major version number change, to prevent users from automatically upgrading to it from the lower major version number. I envision the last version release of the lower major version (e.g. 4.1.20) would present a new option on the upgrade screen, allowing a major version upgrade (5.0.0) but warning them that their database will be created anew and no settings will be transferred over.

I'm trying to think of any other database structural changes that would be needed or would be useful, but I'm just not that familiar with database design. Are there any other changes you can think of that should be included in this update?

etiology commented 7 years ago

I would take the time to rethink how your database objects are related to each other and add foreign key relationships. If you are considering having users setup new settings it gives you the freedom to design the database that you want. Look for areas in the code where relationships between database models is being resolved in software instead of in the database.

I may be mistaken but it looks like there is a relationship between the Relay model and the RelayConditional models. I see in controller_relay.py the following:

    def check_conditionals(self, relay_id, on_duration):
        conditionals = db_retrieve_table(MYCODO_DB_PATH, RelayConditional)

        conditionals = conditionals.filter(RelayConditional.if_relay_id == relay_id)
        conditionals = conditionals.filter(RelayConditional.activated == True)

        if self.is_on(relay_id):
            conditionals = conditionals.filter(RelayConditional.if_action == 'on')
            conditionals = conditionals.filter(RelayConditional.if_duration == on_duration)
        else:
            conditionals = conditionals.filter(RelayConditional.if_action == 'off')

        for each_conditional in conditionals.all():
            message = None
            ...

I might be wrong about this specific object because I don't quite understand what the relay conditional class is doing, but it looks like it could be a one-to-many relationship between Relay and Conditionals. This relationship can be created in the database with a foreign key in the Conditionals table:

class RelayConditional(Base):
    __tablename__ = "relayconditional"

    id = Column(Integer, primary_key=True)
    relay_id = Column(Integer, ForeignKey(relays.id), nullable=False)
    name = Column(TEXT)
    activated = Column(BOOLEAN)
    #... more conditions

Now you can do this instead:

def check_conditionals(self, relay_id, on_duration):
    my_relay = Relay.query.filter_by(id=relay_id).first()
    for condition in my_relay.conditionals:
       # ... do stuff

It would be a large change but it could streamline the code, which makes it easier to manage and work with. It's worth daydreaming about. Does it sound right that Relays could have assigned PIDControllers that have their own set of conditional values? Well that's a relationship that you can make in the database. It makes the objects more useful and it prevents things like database rows containing conditional values that are not referenced or attached to anything.

I'm sure there are tons of relationships in there that could be useful to implement.

kizniche commented 7 years ago

The conditionals are a bit complex. The relay conditionals, in particular, have an "if relay" (if this relay acts) and a "do relay" (then act on this relay). The conditional can also either be active or inactive. The current code is finding only activated conditionals for the particular "if relay" that just changed its state (either turning on or off), then further filtering conditionals with the "if action" (whether the "if relay" turned on or off). This will ultimately yield only relay conditionals that match the state (such as "relay X turned off", or "relay Y turned on for Z seconds"). There may be multiple conditionals that match these criteria, so it's essential to filter for all of them instead of returning the first. The action of these (which can be a number of things that happen) are then executed. I thought it would be easiest to perform the filter on the database query instead of in a loop.

I'm intrigued by database relationships, primary/foreign keys, and how to effectively use them. I'm still trying to wrap my head around how they can be best used in this scenario. I don't believe your example is using the relay conditional table correctly, but I'm also not experienced enough to know what should be done. I'll have to spend a while experimenting with it to understand.

kizniche commented 7 years ago

Also, when the databases are eventually merged into one, and flask is using flask-sqlalchemy, how should the daemon access the database? With the sqlalchemy module that's currently being used? Would using the flask-sqlalchemy models cause issues with that? Again, excuse my limited understanding of databases.

kizniche commented 7 years ago

I just pushed a database model I've been working on. It includes the settings and user tables. This is what I'd like to release with Mycodo 5.0, since a new database will need to be created to use this.

This model should permit moving over to flask-sqlalchemy, to resolve the threading errors that plague the HTTP log, which I think will bring performance improvements.

Also, if I'm understanding foreign keys correctly, we can also refactor a lot of the code to be more efficient at referencing IDs and filtering data in the database to obtain the correct entries/rows.

kizniche commented 7 years ago

I have a new branch I've been working on, using flask_sqlalchemy. I've finally figured it out and am going through the entire software, refactoring with sql_alchemy. I'll push it a little later once I have it all working.

I just got it working up to the login page (after creating database, populating default tables with default entries, and crating an admin user), and am currently on a roll with a new cup of coffee.

etiology commented 7 years ago

Nice work I cant wait to see it. I think flask sqlalchemy is the way to go. It's so much easier to manage and it just plays well.

kizniche commented 7 years ago

Wow, literally 8 seconds after you said that I pushed the new branch ;)

Not all pages are working (I noticed Methods errors for example), but the frontend and backend are working and using the same model, with Flask using flask_sqlalchemy.

etiology commented 7 years ago

I really like the Relay class. It's got everything you think a relay should do and it has state because it's a database object

etiology commented 7 years ago

I could see the Cameras being a single 'Camera' class with a 'camera_type' attribute.

That could even be a foreign key that points to a 'CameraType' table that has a 'name', maybe a recommendation like brand or product number. When you are selecting your camera type it tells you which things are supported.

Just thinking out loud here so you don't really need the extra table but the idea struck me.

kizniche commented 7 years ago

I really like the Relay class.

@Cabalist is the one who actually made the Relay class functions, if I recall correctly.

I'm sure I'm doing some things improperly, but it's at least a start. And it's actually working!

I had a difficult time at first getting both the flask app and the daemon to use the same database model. I attempted to use this method to be able to use the same query syntax (e.g. 'Class.query.first()') in the daemon (outside the flask app), but I wasn't understanding it enough to get it working.

I could see the Cameras being a single 'Camera' with a 'camera_type' attribute.

Ah, yes! That's a great idea. I'll play around with it when I'm more awake.

etiology commented 7 years ago

That is a great article. I'll have to play with it but that's great.

for the camera class combination I was thinking:


class Camera(CRUDMixin, db.Model):
    __tablename__ = "camera"

    id = db.Column(db.Integer, unique=True, primary_key=True)
    relay_id = db.Column(db.Integer, db.ForeignKey('relay.id'), default=None)  # Relay to turn on while capturing
    hflip = db.Column(db.Boolean, default=False)  # Horizontal flip image
    vflip = db.Column(db.Boolean, default=False)  # Vertical flip image
    rotation = db.Column(db.Integer, default=0)  # Rotation degree (0-360)
    cmd_pre_camera = db.Column(db.Text, default='')  # Command to execute before capture
    cmd_post_camera = db.Column(db.Text, default='') # Command to execute after capture
    camera_type = db.Column(db.Integer, db.ForeignKey('CameraType.id'), nullable=False)

class CameraType(CRUDMixin, db.Model):
   """ Holds camera description or kind: 'still', 'timelapse', 'stream', etc """
    __tablename__ = 'camera_type'

    id = db.Column(db.Integer, primary_key=True)
    type_name = db.Column(db.String, nullable=False, unique=True)  # 'Still', 'Timelapse', 'Infrared', etc

# for extra ease you might want to add a CRUDMixin:
class CRUDMixin(object):
    """
    Basic Create, Read, Update and Delete methods
    Models that inherit from this class automatically get these CRUD methods
    """

    def save(self, session):
        """ creates the model in the database """

        try:
            session.add(self)
            session.commit()

            return self
        except Exception as error:
            session.rollback()
            logging.error("Unable to save {model} due to error: {err}".format(model=self, err=error))
            raise error

    def delete(self, session):
        """ deletes the record from the database """
        try:
            session.delete(self)
            session.commit()
        except Exception as error:
            # many things can go wrong during the commit() so we have a broad except clause
            logging.error("Failed to delete '{record}' due to error: '{err}'".format(record=self, err=error))

The CRUDMixin allows your database models to have save() methods along with any others you put in there. It makes it really easy to work with these models:

""" To make things easy lets assume that the camera_type 
      comes from a constants file that just has a dictionary of types """

# namedtuple would be good here too
CAMTYPE = dict(still='Still Camera', time_lapse='Time-lapse', stream='Stream')

# ----- Normal SQLAlchemy Style -VS- CRUDMixin ------ 
from flask import current_app

# without a CRUDMixin:
new_camera = Camera(camera_type=CAMTYPE['time_lapse'])
try:
    db.session.add(new_camera)
    db.session.commit()
except Exception as err:
    msg = "Error occurred when saving {cls} model: {err}"
    current_app.logger.error(msg.format(err=err, cls="Camera")
    session.roll_back()

# do that for each model you want to save.  What if you wanted to make sure you logged this?
# How many copy & paste logging statements would you need?  

# -- VS --

new_camera = Camera(camera_type=CAMTYPE['time_lapse']).save()

kizniche commented 7 years ago

That's nice! It will save a lot of time and code. I'll work on restructuring the database a little more once I have the majority of things working again.

Interestingly, when testing the use of the uuid4 in some of the default values of the database tables, I found something odd:

class PID(db.Model):
    __tablename__ = "pid"

    id = db.Column(db.Integer, unique=True, primary_key=True)
    method_id = Column(String, nullable=False,  default=str(uuid.uuid4()))
    ...

If table entries were created (committed) inside a loop (when adding multiple sensors, relays, etc.), all entries would have the same uuid. It appeared that uuid4() was only being called once and that value was being applied to every new entry.

I found this solution that works nicely to create a unique uuid:

class UUID(types.TypeDecorator):
    impl = MSBinary

    def __init__(self):
        self.impl.length = 16
        types.TypeDecorator.__init__(self,length=self.impl.length)

    def process_bind_param(self,value,dialect=None):
        if value and isinstance(value,uuid.UUID):
            return value.bytes
        elif value and not isinstance(value,uuid.UUID):
            raise ValueError('value {val} is not a valid uuid.UUID'.format(val=value))
        else:
            return None

    def process_result_value(self, value, dialect=None):
        if value:
            return uuid.UUID(bytes=value)
        else:
            return None

    @staticmethod
    def is_mutable():
        return False

class PID(db.Model):
    __tablename__ = "pid"

    id = db.Column(db.Integer, unique=True, primary_key=True)
    unique_id = db.Column(UUID, nullable=False, unique=True, default=uuid.uuid4)
    ...

The main reason a unique UUID is necessary is for influxdb. Entries (sensor measurements, relay durations, pid setpoints) are associated an ID tag when added to the influxdb database. If the id was used instead of the unique_id, then there would be the possibility of using the same ID used in the past.

For instance:

Create & Activate sensor, id=1, unique_id=1883bfee-9418-4f60-9cb9-27bdeb6f9c56
Data begins storing data in influxdb, tagged with an id
Deactivate sensor, delete and recreate influxdb database
Create & Activate sensor, id=1, unique_id=75d224bb-b29b-4530-9af7-a0330f0900cf

If id was used, then the data in influxdb would have no way of distinguishing the data from each other. Worse, if the first sensor values were float and the second sensor values were int, then influxdb would refuse to store the int values when there were initially float values saved. Using unique_id ensures when data is queried/written to influxdb, it is able to be distinguished if a situation were to occur where the id increment is reset by a database wipe.

etiology commented 7 years ago

I should have remembered that for a default field to work in a sqlalchemy column you need to assign it to a function or else the value doesn't always change.

So for the default=str(uuid4()) to work you need to wrap it in a function and pass that into the default instead:

def set_uuid():
    return str(uuid4())

# ... then in your model...

some_unique_text =db.Column(db.String, default=set_uuid, unique=True)

etiology commented 7 years ago

It's just like setting a datetime.now() call

kizniche commented 7 years ago

So this is what I'm now unsure where to go... Should unique_id take the place of all ids for identifying database entries? Or should this only be use for sensors, relays, and pids, since they are the only things needing a truly unique id?

And I suppose I should cut out the UUID class type in favor of just using a function to call uuid4?

etiology commented 7 years ago

There is definitely a debate between people who feel that the primary key of a table should be an int and left alone vs those who feel that a field that is known to be unique could be the primary key.

I argue for the former. The primary key field should be an an int and if you need a serial number like identifier then create a field that holds that random string. Here are some random thoughts to why integer based primary keys feel better to me:

If all primary keys are integers then there is a common behavior between all models. (think back to the zen of python: Special cases aren't special enough to break the rules.)
It carries an implied order to which records were created before others
I have several parent classes that I recycle that make use of this.

The whole function call being passed to default might also work for your original code if you wrap your random.choice() call in a function and assign the function (Not the call) to the default.

That suggestion you found looks like the person didn't understand what was happening and built way more code than what was needed to generate random uuids. I would still test what I suggested to make sure they are unique. Toss it in a for loop and make a million records. If it does't throw and error I'd say that is good proof that you are safe using it.

kizniche commented 7 years ago

That makes sense. Since I only need a uuid for 3 tables, there's also no need for unnecessary overhead in the other tables.

One thing I can see potentially causing problems is if all rows are deleted in a table, the id increment index resets to 1. If a specific entry is set up to reference another entry's id, and if all entries are then deleted and recreated, the reference may still be valid, but undesired.

Is there a way to prevent this?

Also, interesting is the time difference from entering entries different ways. Here's something I made to time and test making entries:

try:
    num_entries = 1000
    first_run = True
    for x in xrange(2):
        logger.error("Starting SQL uuid generation test {num}: "
                     "{n} entries...".format(num=x,
                                             n=num_entries))
        before_count = PID.query.count()
        a = datetime.datetime.now()
        if first_run:
            for _ in xrange(num_entries):
                db.session.add(PID())
                db.session.commit()
        else:
            for _ in xrange(num_entries):
                db.session.add(PID())
            db.session.commit()
        after_count = PID.query.count()
        b = datetime.datetime.now()
        delta = b - a
        logger.error("Completed: {time}, " \
                     "New entries: {new}, " \
                     "Total entries: {tot}".format(
                        time=delta,
                        new=after_count - before_count,
                        tot=after_count))
        first_run = False
except Exception as msg:
    logger.error("Error creating entries: {err}".format(err=msg))

Here's the relavant part:

if first_run:
    for _ in xrange(num_entries):
        db.session.add(PID())
        db.session.commit()
else:
    for _ in xrange(num_entries):
        db.session.add(PID())
    db.session.commit()

And the output:

2017-02-08 21:57:28,415 Starting SQL uuid generation test 1: 100 entries...
2017-02-08 21:58:14,188 Completed: 0:00:45.772291, New entries: 100, Total entries: 1407
2017-02-08 21:58:14,247 Starting SQL uuid generation test 2: 100 entries...
2017-02-08 21:58:15,786 Completed: 0:00:01.539295, New entries: 100, Total entries: 1507

2017-02-08 21:59:36,009 Starting SQL uuid generation test 1: 1000 entries...
2017-02-08 22:06:23,003 Completed: 0:06:46.993685, New entries: 1000, Total entries: 2507
2017-02-08 22:06:23,067 Starting SQL uuid generation test 2: 1000 entries...
2017-02-08 22:06:29,252 Completed: 0:00:06.184640, New entries: 1000, Total entries: 3507

One is clearly more efficient.

etiology commented 7 years ago

yeah closing the session all of the time with the commit is Soooo costly. It's much faster to add a ton of stuff to the session and then commit at the end.

That's a great example and a good observation. It's the kind of test that will change the way you code. The difference is huge.

kizniche commented 7 years ago

Hmm. I seem to be hitting a wall with the test:

2017-02-08 23:45:05,150 Starting SQL uuid generation test: 1000000 entries...
2017-02-08 23:46:50,403 Run Time: 105.11 sec, New entries: 25000, Total entries: 25000
2017-02-08 23:48:41,932 Run Time: 111.38 sec, New entries: 25000, Total entries: 50000
2017-02-08 23:50:44,813 Run Time: 122.68 sec, New entries: 25000, Total entries: 75000
2017-02-08 23:52:53,157 Run Time: 128.12 sec, New entries: 25000, Total entries: 100000
2017-02-08 23:55:12,568 Run Time: 139.16 sec, New entries: 25000, Total entries: 125000
2017-02-08 23:57:32,203 Run Time: 139.36 sec, New entries: 25000, Total entries: 150000
2017-02-09 00:00:41,505 Run Time: 189.00 sec, New entries: 25000, Total entries: 175000
2017-02-09 00:05:11,919 Run Time: 270.08 sec, New entries: 25000, Total entries: 200000
2017-02-09 00:17:36,167 Run Time: 742.30 sec, New entries: 25000, Total entries: 225000
2017-02-09 00:42:47,685 Run Time: 1499.14 sec, New entries: 25000, Total entries: 250000

At this point I went to sleep and woke to find the flask app had crashed. The database had reached ~40 MB. Here's the code that was running:

def test_sql():
    try:
        num_entries = 1000000
        factor_info = 25000
        PID.query.delete()
        db.session.commit()
        logger.error("Starting SQL uuid generation test: "
                     "{n} entries...".format(n=num_entries))
        before_count = PID.query.count()
        run_times = []
        a = datetime.now()
        for x in xrange(1, num_entries + 1):
            db.session.add(PID())
            if x % factor_info == 0:
                db.session.commit()
                after_count = PID.query.count()
                b = datetime.now()
                run_times.append(float((b - a).total_seconds()))
                logger.error("Run Time: {time:.2f} sec, "
                             "New entries: {new}, "
                             "Total entries: {tot}".format(
                                time=run_times[-1],
                                new=after_count - before_count,
                                tot=PID.query.count()))
                before_count = PID.query.count()
                a = datetime.now()
        avg_run_time = sum(run_times) / float(len(run_times))
        logger.error("Finished. Total: {tot} entries. "
                     "Averages: {avg:.2f} sec, "
                     "{epm:.2f} entries/min".format(
                        tot=PID.query.count(),
                        avg=avg_run_time,
                        epm=(factor_info / avg_run_time) * 60.0))
    except Exception as msg:
        logger.error("Error creating entries: {err}".format(err=msg))

I'm not sure what's wrong, why it began taking longer to create entries. python ram use was between 40 and 60 MB last I saw it. Here's an earlier screenshot with top running in the middle.

sql test 02

It's not something I'm going to invest a lot of time into figuring out, but thought it was interesting enough to share.

kizniche / Mycodo

Bug: Text PK can have collisions #115