Questions: August 7, models and relationships

hey @Koeng101!

I hope that you are having a good week! This is the first update for FreeGenes, and it includes a lot of questions - please take the time that you need to look over. These questions pertain to models and relationships, and are important to establish the base for the application. I've ported most of the models into Django. and this first round of questions pertains to that! I'm using GitHub issues so we can keep the history of our discussion, and better include inline code. The items that we can discuss for the meeting Monday (and after in detail over email) are:

The questions included below
I'll need your and Marc's help to write user stories to help design the interface and interaction with it. I gave a shot at a simple one to create an order to give you an example.
An export from the current database so that I can write migration scripts, and also start developing views with actual data.

Without further adieu, here are questions about models! If Marc has a GitHub, please feel free to ping him on this issue so he is notified.

Notes and README

A few of your models had fields for notes, e.g.,

notes = models.CharField(max_length=500)

And then others description:

And then collection has README

readme = models.CharField(max_length=500, required=True)

Are these things fundamentally different, and what is the purpose? For description it makes most sense - we'd want to store that to tell the user what the thing is, and ideally we can populate it from somewhere automatically. For example, Singularity Hub grabs descriptions for collections directly from the GitHub repos they represent. For notes, it's generally not great practice to have a random "notes" field. Is README is along the same lines? I'd like to clarify the purpose of notes, and if appropriate, change some to description fields, or something more specific that we want to capture.

Definitions

Across models, I'd like to have a basic definition in the docstring. Some are obvious, but others not! If you want to write them out in this email, or pull request to change the models file, we can do whatever works best / is easiest for you. This is important for future developers to read the code, and have an understanding of things. Here is an example of my best guess for an Author.

Another kind of definition is one associated with a multiple choice field. Across models, many of the schemas include variables that include choices for things. For example, the user can choose a part type from these:

    PART_TYPE = [
        ('full_promoter', 'full_promoter'),
        ('promoter', 'promoter'),  
        ('rbs', 'rbs'),  
        ('cds', 'cds'),
        ('vector', 'vector'),
        ('partial_seq', 'partial_seq'),
        ('linear_dna', 'linear_dna'),
        ('plasmid', 'plasmid'),
        ('terminator', 'terminator'),    
    ]

Where the first entry is the key for Django, and the second is a verbose description. Currently, across models I've just duplicated the key, but ideally (since people will be reading the second of each) we would want to have a verbose description. I don't have this knowledge, so I'm hoping you can look over the file and fill them in. For example (and this is a guess):

    PART_TYPE = [
        ('full_promoter', 'The full DNA sequence that defines where transcription of a gene by RNA polymerase begins. '),
...

Model Questions:

This is a random listing of questions about models:

What is the maximum length for a sequence? (I currently have set to 250, with the dna_string validation)
What is the genbank data structure for, under Part?
Also under Part, for fields vector, primer_for, primer_rev, translation, ip_check, and vdb, it would be good to have descriptions in comments for future folks. I could guess, but I'm not sure what an ip_check is, or what vdb stands for. The code address is here.
Can a Part only belong to one collection?
What are the x,y,z coordinates of a container for?
what exactly is a container module?
modules have schemas - could you explain this?
what is a breadcrumb under a plate (I'm only familiar with breadcrumbs on web pages, for navigation)
what is a Plate thaw_count?
what is "index_for" for a sample and "index_rev"? I assume forward and reverse, so I'd want to add comment to those fields about what that means.

Well Location Validation

I noticed that you have a schema to validate the address of a well based on a plate length and height (do I have this correct?) This would work if the entire application, forever into the future, will always use consistent dimension - but what if this changes? What I'd like to suggest to do is to have a post save signal that, when a plate is saved, generates Well objects for it,. This assumes that we should have wells represented in the database for every plate, if this isn't the case we will need a different idea. This also means we should have a length and height added as parameters to the Plate model, with defaults being the ones you originally set in your function (16 and 24). Then we don't need to validate the Well address - because it's generated when the plate is first saved and then not changed. Do you think this is a reasonable thing to do?

Model Relationships

When we define relationships between models in Django, it's customary to also define what happens when the parent is deleted. For example, if I have the following

collection = models.ForeignKey('Collection', on_delete=models.CASCADE)

The "CASCADE" says that "when the Collection is deleted, also delete this object. if I have "DO_NOTHING":

collection = models.ForeignKey('Collection', on_delete=models.DO_NOTHING)

It means exactly that :) So basically what I need to confirm are all of these relationships in the models. I'll list them below. You can say true/false or yes/no for all.

when a collection is deleted, so are the associated parts
when an author is deleted, his parts are deleted
when a parent collection is deleted, so are the children collections
when a parent container is deleted, so are the children
when a container image (file?) is deleted, do nothing to the container (i.e., a container can exist without the image file)
when a module's container is deleted, so is the module
when a module is deleted that has other modules defined as right/left mounts, the right/left mounts are also deleted (what does this mean, in a physical sense?) What is a mount?
when an agreement file for an MTA is deleted, so is the MTA
when a sample that has other samples derived from it is deleted, the derivations are also deleted
when a sample's part is deleted, the sample is also deleted.
when a plate is deleted, wells belonging to it are also deleted
when an organism is deleted, the associated well is not (the well can be emptied)
when a schema is deleted, associated protocols are deleted
when a plan parent is deleted, the child is also deleted
when a plan operation is deleted, the plan is not touched
when a user is deleted, his/her associated operations are too
when an MTA is deleted, we don't delete the operation (it could be that it needs to be changed, etc.)

Samples:

I'm not totally clear on what a sample is, but from the other models I can deduce that it's something that goes into a well? So my question is - can the same sample exist between parts? (e.g., are part samples ManytoMany relationships or Foreign keys?)

There is a field called "evidence" - what is sample evidence? I noticed that the labels defined in the schema:

    SAMPLE_EVIDENCE = [
        ('Twist_Confirmed', 'Twist_Confirmed'),
        ('NGS', 'NGS'),
        ('Sanger', 'Sanger'),
        ('Nanopore', 'Nanopore'),
        ('Derived', 'Derived')
    ]

Don't match a comment that you wrote (before or after)? that makes a distinction between capitalization and "outside folks" eg:

    # ngs, sanger, TWIST - capitals denote outside folks

Does this mean that all options are valid, either lowercase or uppercase? So instead of "Sanger", there should be "SANGER" (outside folks) and "sanger?"

If this is the case, I'd like to suggest we have a consistent, single set of labels (capitalization set or not important) and then a boolean to indicate outside_collaborator. Just curious, why is this information important? Is there different protocol for within / outside of Stanford?

MTAs

It looks like the original setup had MTAs hosted alongside other remote files (e.g., in some cloud storage). If these come down to PDF documents (and similar) what would be the rationale to store them in cloud storage vs. just alongside the server? Either could work, and I'm interested to hear your thoughts if one is preferable.

Speaking of MTAs - the current setup has the "agreement_file" referencing a foreign key to a File object, which (based on the previous model) would be uploaded to some cloud storage. In the case that we can store the file on the server (if it's just PDF) and in the case that we don't need to share file's between MTAs (it's a 1 to 1 relationship) my suggestion here would be to remove the ForeignKey:

agreement_file = models.ForeignKey('Files', on_delete=models.CASCADE)

And model the file as a standard Django file object.

PlateSet vs. Distribution

I noticed that we have two layers of "things to send out to people":

a plateset is a group of plates
a distribution is one or more platesets.

I'm wondering why there is the need to have this extra level of grouping, for example, why not just have platesets represented in orders, and remove the distribution type? Or better, in the case that we want to distribute more than sets of plates, have a general distribution model that uses ContentTypes to allow for different kinds of objects to be represented. If it's the case that the only thing being sent out is groups of plates, we might not need this extra layer.

Organism Control

You had a comment about "organism control" that I wanted to ask about -

    # organism = db.Column(db.String) # IMPLEMENT ORGANISM CONTROL

Is this something we can do now, or a future TODO (in which case I'll create an issue on the repository for it).

Wells

The original model had samples required for wells, but why is a sample required (i.e. couldn't a well be empty)? I modeled the Sample to point to a well instead, and if we delete the sample we don't delete the well. In this case, it shouldn't be required to allow for an empty well.

# Volume is listed as required - where does it come from? If it's added
# when we fill a well, it shouldn't be required. If it's required,
# it would need to be generated by plate post_save, so the plate would
# need to include it's volume per well. The same is true for media.

Schemas

I noticed that you modeled a Schema as it's own model in database, and then for Protocols you linked to one. It would be logical to do this if schemas are shared between protocols (like templates) but it wouldn't make sense to have this if a schema is like a snowflake - every one different. We could just use a JSONField (if json) or text or file for something like that.

Protocols

Is there any reason to not have a boolean for the status of a protocol (e.g., something like is_executed=True/False, and then if False, we assume it's planned). Also, for protocol fields required, one of the fields was "protocol" but I don't see it:

    # protocol_required = ['protocol','schema_uuid']
    # QUESTION: for protocol required you had protocol but I don't see a field. Did you mean plates? or the data?

Should it be something else?

Plans

I noticed that Plans have similar statuses to Protocols - Executed, Planned, etc. Should there be a common set of statuses between these (and other models) that are shared?

    # These are similar to plate statuses - are these global statuses that should
    # be present across protocols, plates, samples, wells?
    PLAN_STATUS = [
        ('Executed','Executed'), 
        ('Trashed','Trashed')
        ('Planned','Planned')
    ]

And what's the rationale for "Trashed" - wouldn't we just delete the object? I have a cool setup to allow different objects to be included in plans, I hope it works! :)

Next Steps

I have quite a bit to work on before needing any of this feedback, user stories, or data export, so please no rush! I am going to be thinking about different authentication backends, and what will make sense for development, vs. admin login, vs. user login. Since orcid-id and institution are present in some fields, it might make sense to try something like Incommon or just login with orcid id. After this round of work and testing authentication, I'll have a sense of the user stories and can start on the front end views. Then after that we can discuss your current shipping / order methods and how we can make something work that doesn't store PI in the database. But anyway - I'm getting ahead of myself! For now - don't worry about the next steps, just think about the questions in this issue. Please take all the time that you need, and ask for any clarification that you need! If you want to tackle one chunk at a time, when you have time, in the comments below, that works for me :) Also, if you'd prefer many smaller issues (and then organization with a Milestone) I can do that for next time.

Notes and README

Notes are on physical objects in lab, and basically function to replace sticky notes (if you look at modules it has some examples). Readme is much more like a description, so doesn't really change too much over time.

Definitions

I'll do a pull request later for docstrings.

Also will for the verbose descriptions on the multiple choice fields.

Models

Sequence

Sequence at MINIMIUM needs to be ~10,000. I prefer this to be much larger, however, and would want at least ~100,000. Nearly none of our genes will fit in 250. In the most ideal world, sequence field on an organism would be ~4,000,000, but I know that isn't very practical.
Genbank is a bit odd, it's just some JSON I generated from previous genbank files about each part. The goal is to go back later and implement JSON <-> genbank format, but I haven't gotten there yet. You can consider it historical data.
On my list for pull. vbd for one can be removed, but all others actually mean something, and ip_check has some odd wording that is required for legal reasons
Originally I had a Part only belong to one collection for simplicity, but I think maybe we should remove that requirement. I would like for a part to be able to belong to many collections.

Containers and modules

The x,y,z coordinates are the meter coordinates for the locations of things in lab. You can imagine that on an overlay of the lab, so a new person can figure out where things live. I figured out a way to do this, but is completely non-essential right now.
A container module basically represents a tree view of lab. Inside of a lab is a room, inside of that room is a freezer, inside of that freezer are shelves, inside of those shelves are racks, and inside those racks are plates. For example, see here https://api.freegenes.org/containers/tree_view_full/ . The idea is to be able to tell an end user "please grab this plate from this specific location and move it to this other specific location"
Modules are physical lab capabilities. From a protocol-generation perspective side (protocol in terms of a lab protocol, or lab procedure), you can query for a lab's capabilities and then generate a protocol that fits their capabilities. If I have an OpenTrons available with a pipette, it will make a protocol for an opentrons, if I have a human available, it will make a protocol for a human. This, I believe, is important further down the line to have for general-use protocols, and so have some simple implementations here. Right now, there are magdeck modules, pipette modules, and tempdeck modules, each of which have a different JSON schema.

Plates and samples

Breadcrumb is the legacy implementation of the second point where I just stored the information as a string. It is used as navigation just like in a web page, except in the real world.
It is very important to track the number of times a plate has been frozen and thawed. Each freeze thaw damages the cells, until eventually the plate is unusable.
"index_for" and "index_rev" have a pretty specific meaning that I can write in or can talk to you about on the phone. They are used for samples we want to sequence, and although we don't need them often, we do need them to automate sequencing. They do stand for "Index forward" and "Index reverse".

Well Locations

Completely reasonable, though we would be generating a lot of extra wells. I had the idea of doing that originally but never got around to it.

Model Relationships

FALSE: if a collection is deleted, all parts remain
FALSE: authors CANNOT be deleted unless they have no parts
FALSE: if a parent collection is deleted, all children remain
TRUE: containers usually exist without images
FALSE: If a module's container is deleted, it gets moved to it's parent. The root may not be deleted
TRUE: If a module gets deleted, it is removed from the mount. The mount is the physical place the module is on on the robot (it is necessary for generating protocols)
TRUEish: We never want to delete an agreement file, but if we absolutely have to, the MTA is deleted.
FALSE: A sample with derivations cannot be deleted
FALSE: A part with samples cannot be deleted
FALSE: A plate with wells cannot be deleted. HOWEVER, you can change the container of the plate to be in the trash, and so it won't be searched on anymore, but simply exist for historical tracking.
FALSE: Organism with wells cannot be deleted
FALSE: Protocols exist without schemas. Schema deletion is technically allowed, but not unless an admin does it or something
TRUE: if a parent plan is deleted, the children plans are meaningless. However, this can only be done with un-executed plans. Once a plan has been executed, it cannot be deleted.
FALSE: An operation with plans cannot be deleted
FALSE: Their associated operations are not touched
TRUE: Could be that is just needs to be changed

As you can tell, there is a large bias towards not deleting things, mostly because physical things continue to exist in the world.

Samples

In the biological world, we can have a piece of DNA that is hypothetically matches what we have on a computer, but most of the time we don't really know. A sample is simply a part that we have some level of confidence exists in the real world.

For example, if we have some frozen bacteria, and scrape a little off and put it into a new tube to grow overnight, the new growth is not the same sample as the frozen bacteria (maybe there was some contamination, or maybe we bottlenecked the population so there are more mutations). We think it's status is confirmed, but the only evidence we have is that it was "Derived" from a confirmed sample.

However, if now we take that new growth and split it between 10 wells and freeze them, all those new frozen bacteria are the same sample (they theoretically are all the same).

Evidence

At very best, we have sequencing data for a sample in our lab. This is the "evidence" we have for the state of a sample. The next best is someone else sequencing the sample (in which we are often not privy to the data), and the worst is if the evidence of the state of a sample is just that it is Derived from something else.

Yes, a boolean for outside_collaborator works (that is sort of what I was thinking with vendor).

Importance: Each sequencing method is different (Sanger kind of sucks, and so I barely trust it, while NGS is very strong, and so I completely trust it) which is important. Sequencing providers we also have more or less confidence in: In my professional experience (of both ordering and watching others order) sometimes we get samples that just aren't that we asked for. If we decide to resequence some samples from X and see a 50% mutation rate, we know that we should probably check out the rest of the samples from X.

MTAs

Files are stored in cloud storage right now because originally, I stored sequencing files there, and decided to store the rest of the files in the same way because it worked. However, I moved sequencing files out of the main app / database into a different service.

I think now storing files within the database makes the most sense, because we basically just have to store small text files and pdfs.

PlateSet vs Distribution

A plateset is actually just a collection of identical plates. A distribution is a collection of platesets.

When you order a distribution, you are technically ordering one plate from each one of the platesets in the distribution. Only once a shipment comes around does it become concrete which plates are physically getting sent out.

A the minimium, a distribution can be a single plateset. At most, a distribution can be all platesets (and there may be overlap)

The only thing that is being sent out is 1 or more plates, but I think the abstraction is useful.

Organism Control

That comment and the "organism" column may be deleted now that organism_uuid is there. Just hadn't gotten around to that either.

Wells

Yes, it should not be that samples are required for wells, I'm not sure why that is in there.

Schemas

Schemas are shared between types of protocols, and so aren't supposed to be special snowflakes.

Protocols

I messed up on that requirement, my bad. Should require data, not protocol.

The idea for the statuses is that there could be other statuses like "Failed", "Interrupted", "Trashed" or the like. For now, a boolean works.

Plans

Plans and Protocols are the 2 I think share that, and it actually should be possible to just not store status in protocol and instead store it in the plan.

My hope is that even if a plan is trashed it can still be tracked! What if we executed a few plans in an operation but decided to stop? The children are trashed, but the others are executed, and we can see exactly where we stopped.

Next steps

Next call let me know of our options for login / authentication. I'd like it to be open source if possible, but understand if that's not possible. Cheers!

Thanks @Koeng101 this looks fantastic! I'm closing up shop for today (need to make dinner) but I'll start on this first thing tomorrow. For authentication, I thought it would be easy enough to allow for the deployment to choose from any number of oauth2 (e.g., think Orcid, GitHub, Google, Twitter, etc.) or an institutional login (with SAML, requires more setup). Orcid would be a good fit for OAuth I think because we want to keep researcher orcid ids anyway. I'm also going to look into some of the federations that manage SAML configurations, haven't done that yet! For testing, likely I'll just plug in something easy like GitHub or Twitter.

Updates August 13th

@Koeng101 I've addressed your comments above (thank you!) and changes are reflected below. If you have further comments for changes let me know, and we can close the issue when all points have been addressed.

confirmed that tags are shared between models, so we will represent them as their own model with ManytoManyRelationships to other objects. I added a unique constraint to the name of the tag, because we wouldn't want it to be duplicated in the database. Question: what do you think about enforcing lowercase only? Or any other strict format? We wouldn't want resistance:ampicillin and resistance:Ampicillin to be treated as different things. I added all lowercase - and if there is any other regular expression checking we should do let me know.
since "notes" represents a physical thing in the lab, I left it there, although I want to confirm that it's just one long text field for an instance (meaning it's just a single note, or a bunch of random text compiled to one). For README, since there are several instances of "description" I changed it to description to be consistent (since it basically means the same thing).
confirmed you will do a PR for docstrings - I assigned you to the issue.
I've never created fields that large in postgres - it looks like 100k should be reasonable but I don't know what best practice is. I checked with a bunch of colleagues, and according to them it's very commonly done to store large TextFields (that use something called TOAST for compression) as long as we don't search over them. See https://github.com/vsoch/freegenes/issues/9.
vdp is removed
Parts can belong to more than one collection
model relationships have been fixed! Wow, I did worse than if I guessed by chance. :)
With respect to "If a module's container is deleted, it gets moved to it's parent. The root may not be deleted" I assume here you mean the root container - meaning if a container doesn't have a parent, we prevent deleting it, and if any container is deleted, the modules associated are assigned to the parent. Here is how I handled that - a pre_delete signal on Container to do both:

@receiver(pre_delete, sender=Container, dispatch_uid='container_pre_delete_signal')
def protect_containers(sender, instance, using, **kwargs):
    '''The root (container) may not be deleted. If a container is deleted, it's
       modules are moved to the parent.
    '''
    # Don't allow delete of the root container
    if not instance.parent: 
        raise ProtectedError('A root container may not be deleted')

    # Move any modules belonging to the container to parent container.
    for module in instance.module_set.all():
        module.container = instance.parent
        module.save()

And this assumes that a module is assigned to just one container.

you mentioned "If a module gets deleted, it is removed from the mount. The mount is the physical place the module is on on the robot (it is necessary for generating protocols)" I assume this means that left_mount or right_mount can be None, but it doesn't mean that we delete the robot itself.
just to clarify schemas - a Schema (object) is a shared model to describe a protocol. But a schema for a module is just a json validation based on the type.
should Schema hashes (schema_hash field) be calculated on save? What kind of hash function?
I misspoke when I said "when a user is deleted, his/her associated operations are too" along with the MTA associated. I meant to say "when a user is deleted, his/her associated orders are too." I've adjusted this so that a user's orders aren't deleted when he/she is (and this would be logical to keep a record of them). And the Order isn't deleted if the MTA is (as we said, it might just need to be changed).
MTA Agreement files are going to be stored on the server, and named based on the uuid of the agreement. I've created a /var/www/data folder to hold them in an MTA folder, meaning the web server will serve them at [hostname]/data/MTA/[uuid]. There is (imperfect) protection in that the uuid needs to be known to download the file. I'm also taking the extension to be whatever the user provides, assuming it's not a compressed file (e.g., ..pdf or .docx is ok, tar.gz wouldn't be expected).
okey doke, I added Distribution back into the models - if you think it's a useful abstraction we should keep it.
I removed status from protocol, and added Failed, Interrupted, and Trashed to protocol statuses. A status for a protocol will be represented by its plan.

Just pushed updates to the models, along with the docs to include some of your comments here (also added to code comments) all changes here. I'm going to look at the other issues you addressed now!

Questions and answers

Yes, we should enforce lowercase and no spaces, no quotes (both double and single).
Notes are just one long text field. Sounds good with readme
On it after this
Text seems like it would be most appropriate for sequences. It looks like it can store up to 1 GB, which is more than enough. Put another note in #11
Great
Great
Looks good, very readable code
Yes correct
It is calculated on save with sha256, just because that is what I felt most comfortable using.
Seems like that would be harder to maintain than like a bucket or directly storing in the database (since containers running on servers should be ethereal). Can you teach me why you think hosting on a server is the best way to go?
Great
Great

Okay just added a function to generate the hash from the schema, which is called only if it's not None (which would throw an error by the Super class anyway):

import hashlib
import json

def generate_sha256(content):
    '''Generate a sha256 hex digest for a string or dictionary. If it's a 
       dictionary, we dump as a string (with sorted keys) and encode for utf-8.
       The intended use is for a Schema Hash

       Parameters
       ==========
       content: a string or dict to be hashed.
    '''
    if isinstance(content, dict):
        content = json.dumps(content, sort_keys=True).encode('utf-8')
    return "sha256:%s" % hashlib.sha256().hexdigest()

Just curious, what is it used for? Would it be a quick check (using the same function) to check if a schema is changed?

File Objects stored in Django are stored on the file system - I misunderstood you when I read this:

I think now storing files within the database makes the most sense, because we basically just have to store small text files and pdfs.

to think you meant using the Django FileField object (that points to a file). Django does provide a BinaryField but if you read the text in the green box:

Although you might think about storing files in the database, consider that it is bad design in 99% of the cases. This field is not a replacement for proper static files handling.

And I agree - if we want an ephemeral thing, we can't rely on any given filesystem. But I think I found a nice solution! For local development, we wouldn't want to require the developer to set up cloud storage - that's quite a burden to just test / develop something locally. But for production, we'd want the files to go there. I found a module, django-storage that would let us define file fields, but then based on custom logic in settings, for example, have it route to the local file system or Google Storage (or any other object storage!) What I'll do is develop with the FileObject for now, and when we move to the cloud somewhere add in the storage parts. :) It's really popular on GitHub and has frequent updates, so it's okay to use I think.

The hash was mostly there to make sure every schema is unique since I couldn't add a "unique" constraint to JSON raw data. It's sort of a hack, but worked well enough.

Got it with django-storage, sounds good to me!

Ah I'm glad I asked! We can just set unique=True and it will set the constraint for us :)

vsoch / freegenes