vsoch / freegenes

FreeGenes BioNode with Django
https://vsoch.github.io/freegenes/
Mozilla Public License 2.0
2 stars 4 forks source link

Questions: August 7, models and relationships #7

Closed vsoch closed 5 years ago

vsoch commented 5 years ago

hey @Koeng101!

I hope that you are having a good week! This is the first update for FreeGenes, and it includes a lot of questions - please take the time that you need to look over. These questions pertain to models and relationships, and are important to establish the base for the application. I've ported most of the models into Django. and this first round of questions pertains to that! I'm using GitHub issues so we can keep the history of our discussion, and better include inline code. The items that we can discuss for the meeting Monday (and after in detail over email) are:

Without further adieu, here are questions about models! If Marc has a GitHub, please feel free to ping him on this issue so he is notified.

Tags

I noticed that you model a tag in the database, and I'm curious how these are used. Django has a few easy ways to represent tags associated with models, and it comes down to if we want tags shared between models, or not. My questions are:

I want to make sure we aren't adding a model to the database that doesn't serve a strong purpose. Another risk is having a bunch of random tags that aren't really meaningful.

Notes and README

A few of your models had fields for notes, e.g.,

notes = models.CharField(max_length=500)

And then others description:

And then collection has README

readme = models.CharField(max_length=500, required=True)

Are these things fundamentally different, and what is the purpose? For description it makes most sense - we'd want to store that to tell the user what the thing is, and ideally we can populate it from somewhere automatically. For example, Singularity Hub grabs descriptions for collections directly from the GitHub repos they represent. For notes, it's generally not great practice to have a random "notes" field. Is README is along the same lines? I'd like to clarify the purpose of notes, and if appropriate, change some to description fields, or something more specific that we want to capture.

Definitions

Across models, I'd like to have a basic definition in the docstring. Some are obvious, but others not! If you want to write them out in this email, or pull request to change the models file, we can do whatever works best / is easiest for you. This is important for future developers to read the code, and have an understanding of things. Here is an example of my best guess for an Author.

Another kind of definition is one associated with a multiple choice field. Across models, many of the schemas include variables that include choices for things. For example, the user can choose a part type from these:

    PART_TYPE = [
        ('full_promoter', 'full_promoter'),
        ('promoter', 'promoter'),  
        ('rbs', 'rbs'),  
        ('cds', 'cds'),
        ('vector', 'vector'),
        ('partial_seq', 'partial_seq'),
        ('linear_dna', 'linear_dna'),
        ('plasmid', 'plasmid'),
        ('terminator', 'terminator'),    
    ]

Where the first entry is the key for Django, and the second is a verbose description. Currently, across models I've just duplicated the key, but ideally (since people will be reading the second of each) we would want to have a verbose description. I don't have this knowledge, so I'm hoping you can look over the file and fill them in. For example (and this is a guess):

    PART_TYPE = [
        ('full_promoter', 'The full DNA sequence that defines where transcription of a gene by RNA polymerase begins. '),
...

Model Questions:

This is a random listing of questions about models:

Well Location Validation

I noticed that you have a schema to validate the address of a well based on a plate length and height (do I have this correct?) This would work if the entire application, forever into the future, will always use consistent dimension - but what if this changes? What I'd like to suggest to do is to have a post save signal that, when a plate is saved, generates Well objects for it,. This assumes that we should have wells represented in the database for every plate, if this isn't the case we will need a different idea. This also means we should have a length and height added as parameters to the Plate model, with defaults being the ones you originally set in your function (16 and 24). Then we don't need to validate the Well address - because it's generated when the plate is first saved and then not changed. Do you think this is a reasonable thing to do?

Model Relationships

When we define relationships between models in Django, it's customary to also define what happens when the parent is deleted. For example, if I have the following

collection = models.ForeignKey('Collection', on_delete=models.CASCADE)

The "CASCADE" says that "when the Collection is deleted, also delete this object. if I have "DO_NOTHING":

collection = models.ForeignKey('Collection', on_delete=models.DO_NOTHING)

It means exactly that :) So basically what I need to confirm are all of these relationships in the models. I'll list them below. You can say true/false or yes/no for all.

Samples:

I'm not totally clear on what a sample is, but from the other models I can deduce that it's something that goes into a well? So my question is - can the same sample exist between parts? (e.g., are part samples ManytoMany relationships or Foreign keys?)

There is a field called "evidence" - what is sample evidence? I noticed that the labels defined in the schema:

    SAMPLE_EVIDENCE = [
        ('Twist_Confirmed', 'Twist_Confirmed'),
        ('NGS', 'NGS'),
        ('Sanger', 'Sanger'),
        ('Nanopore', 'Nanopore'),
        ('Derived', 'Derived')
    ]

Don't match a comment that you wrote (before or after)? that makes a distinction between capitalization and "outside folks" eg:

    # ngs, sanger, TWIST - capitals denote outside folks

Does this mean that all options are valid, either lowercase or uppercase? So instead of "Sanger", there should be "SANGER" (outside folks) and "sanger?"

If this is the case, I'd like to suggest we have a consistent, single set of labels (capitalization set or not important) and then a boolean to indicate outside_collaborator. Just curious, why is this information important? Is there different protocol for within / outside of Stanford?

MTAs

It looks like the original setup had MTAs hosted alongside other remote files (e.g., in some cloud storage). If these come down to PDF documents (and similar) what would be the rationale to store them in cloud storage vs. just alongside the server? Either could work, and I'm interested to hear your thoughts if one is preferable.

Speaking of MTAs - the current setup has the "agreement_file" referencing a foreign key to a File object, which (based on the previous model) would be uploaded to some cloud storage. In the case that we can store the file on the server (if it's just PDF) and in the case that we don't need to share file's between MTAs (it's a 1 to 1 relationship) my suggestion here would be to remove the ForeignKey:

agreement_file = models.ForeignKey('Files', on_delete=models.CASCADE)

And model the file as a standard Django file object.

PlateSet vs. Distribution

I noticed that we have two layers of "things to send out to people":

I'm wondering why there is the need to have this extra level of grouping, for example, why not just have platesets represented in orders, and remove the distribution type? Or better, in the case that we want to distribute more than sets of plates, have a general distribution model that uses ContentTypes to allow for different kinds of objects to be represented. If it's the case that the only thing being sent out is groups of plates, we might not need this extra layer.

Organism Control

You had a comment about "organism control" that I wanted to ask about -

    # organism = db.Column(db.String) # IMPLEMENT ORGANISM CONTROL

Is this something we can do now, or a future TODO (in which case I'll create an issue on the repository for it).

Wells

The original model had samples required for wells, but why is a sample required (i.e. couldn't a well be empty)? I modeled the Sample to point to a well instead, and if we delete the sample we don't delete the well. In this case, it shouldn't be required to allow for an empty well.

# Volume is listed as required - where does it come from? If it's added
# when we fill a well, it shouldn't be required. If it's required,
# it would need to be generated by plate post_save, so the plate would
# need to include it's volume per well. The same is true for media.

Schemas

I noticed that you modeled a Schema as it's own model in database, and then for Protocols you linked to one. It would be logical to do this if schemas are shared between protocols (like templates) but it wouldn't make sense to have this if a schema is like a snowflake - every one different. We could just use a JSONField (if json) or text or file for something like that.

Protocols

Is there any reason to not have a boolean for the status of a protocol (e.g., something like is_executed=True/False, and then if False, we assume it's planned). Also, for protocol fields required, one of the fields was "protocol" but I don't see it:

    # protocol_required = ['protocol','schema_uuid']
    # QUESTION: for protocol required you had protocol but I don't see a field. Did you mean plates? or the data?

Should it be something else?

Plans

I noticed that Plans have similar statuses to Protocols - Executed, Planned, etc. Should there be a common set of statuses between these (and other models) that are shared?

    # These are similar to plate statuses - are these global statuses that should
    # be present across protocols, plates, samples, wells?
    PLAN_STATUS = [
        ('Executed','Executed'), 
        ('Trashed','Trashed')
        ('Planned','Planned')
    ]

And what's the rationale for "Trashed" - wouldn't we just delete the object? I have a cool setup to allow different objects to be included in plans, I hope it works! :)

Next Steps

I have quite a bit to work on before needing any of this feedback, user stories, or data export, so please no rush! I am going to be thinking about different authentication backends, and what will make sense for development, vs. admin login, vs. user login. Since orcid-id and institution are present in some fields, it might make sense to try something like Incommon or just login with orcid id. After this round of work and testing authentication, I'll have a sense of the user stories and can start on the front end views. Then after that we can discuss your current shipping / order methods and how we can make something work that doesn't store PI in the database. But anyway - I'm getting ahead of myself! For now - don't worry about the next steps, just think about the questions in this issue. Please take all the time that you need, and ask for any clarification that you need! If you want to tackle one chunk at a time, when you have time, in the comments below, that works for me :) Also, if you'd prefer many smaller issues (and then organization with a Milestone) I can do that for next time.

Koeng101 commented 5 years ago

Tags

Tags are little bits of information that come from experience / science. For example, we may tag a plasmid part as conferring ampicillin resistance, so it would include the tag resistance:ampicillin. That same part could also be tagged as an essential_gene - and so basically the tags are ways to organize the different categories they are associated with.

I do actually think tags should be stored between models. We may want to find all things associated with mesoplasma florum. That may include the collections that are associated with Mesoplasma florum, the parts associated with Mesoplasma florum, or even perhaps the authors associated with that category.

I think proper tagging of genes is perhaps one of the most important pieces of information we can create.

Notes and README

Notes are on physical objects in lab, and basically function to replace sticky notes (if you look at modules it has some examples). Readme is much more like a description, so doesn't really change too much over time.

Definitions

I'll do a pull request later for docstrings.

Also will for the verbose descriptions on the multiple choice fields.

Models

Sequence

Containers and modules

Plates and samples

Well Locations

Completely reasonable, though we would be generating a lot of extra wells. I had the idea of doing that originally but never got around to it.

Model Relationships

As you can tell, there is a large bias towards not deleting things, mostly because physical things continue to exist in the world.

Samples

In the biological world, we can have a piece of DNA that is hypothetically matches what we have on a computer, but most of the time we don't really know. A sample is simply a part that we have some level of confidence exists in the real world.

For example, if we have some frozen bacteria, and scrape a little off and put it into a new tube to grow overnight, the new growth is not the same sample as the frozen bacteria (maybe there was some contamination, or maybe we bottlenecked the population so there are more mutations). We think it's status is confirmed, but the only evidence we have is that it was "Derived" from a confirmed sample.

However, if now we take that new growth and split it between 10 wells and freeze them, all those new frozen bacteria are the same sample (they theoretically are all the same).

Evidence

At very best, we have sequencing data for a sample in our lab. This is the "evidence" we have for the state of a sample. The next best is someone else sequencing the sample (in which we are often not privy to the data), and the worst is if the evidence of the state of a sample is just that it is Derived from something else.

Yes, a boolean for outside_collaborator works (that is sort of what I was thinking with vendor).

Importance: Each sequencing method is different (Sanger kind of sucks, and so I barely trust it, while NGS is very strong, and so I completely trust it) which is important. Sequencing providers we also have more or less confidence in: In my professional experience (of both ordering and watching others order) sometimes we get samples that just aren't that we asked for. If we decide to resequence some samples from X and see a 50% mutation rate, we know that we should probably check out the rest of the samples from X.

MTAs

Files are stored in cloud storage right now because originally, I stored sequencing files there, and decided to store the rest of the files in the same way because it worked. However, I moved sequencing files out of the main app / database into a different service.

I think now storing files within the database makes the most sense, because we basically just have to store small text files and pdfs.

PlateSet vs Distribution

A plateset is actually just a collection of identical plates. A distribution is a collection of platesets.

When you order a distribution, you are technically ordering one plate from each one of the platesets in the distribution. Only once a shipment comes around does it become concrete which plates are physically getting sent out.

A the minimium, a distribution can be a single plateset. At most, a distribution can be all platesets (and there may be overlap)

The only thing that is being sent out is 1 or more plates, but I think the abstraction is useful.

Organism Control

That comment and the "organism" column may be deleted now that organism_uuid is there. Just hadn't gotten around to that either.

Wells

Yes, it should not be that samples are required for wells, I'm not sure why that is in there.

Schemas

Schemas are shared between types of protocols, and so aren't supposed to be special snowflakes.

Protocols

I messed up on that requirement, my bad. Should require data, not protocol.

The idea for the statuses is that there could be other statuses like "Failed", "Interrupted", "Trashed" or the like. For now, a boolean works.

Plans

Plans and Protocols are the 2 I think share that, and it actually should be possible to just not store status in protocol and instead store it in the plan.

My hope is that even if a plan is trashed it can still be tracked! What if we executed a few plans in an operation but decided to stop? The children are trashed, but the others are executed, and we can see exactly where we stopped.

Next steps

Next call let me know of our options for login / authentication. I'd like it to be open source if possible, but understand if that's not possible. Cheers!

vsoch commented 5 years ago

Thanks @Koeng101 this looks fantastic! I'm closing up shop for today (need to make dinner) but I'll start on this first thing tomorrow. For authentication, I thought it would be easy enough to allow for the deployment to choose from any number of oauth2 (e.g., think Orcid, GitHub, Google, Twitter, etc.) or an institutional login (with SAML, requires more setup). Orcid would be a good fit for OAuth I think because we want to keep researcher orcid ids anyway. I'm also going to look into some of the federations that manage SAML configurations, haven't done that yet! For testing, likely I'll just plug in something easy like GitHub or Twitter.

vsoch commented 5 years ago

Updates August 13th

@Koeng101 I've addressed your comments above (thank you!) and changes are reflected below. If you have further comments for changes let me know, and we can close the issue when all points have been addressed.

@receiver(pre_delete, sender=Container, dispatch_uid='container_pre_delete_signal')
def protect_containers(sender, instance, using, **kwargs):
    '''The root (container) may not be deleted. If a container is deleted, it's
       modules are moved to the parent.
    '''
    # Don't allow delete of the root container
    if not instance.parent: 
        raise ProtectedError('A root container may not be deleted')

    # Move any modules belonging to the container to parent container.
    for module in instance.module_set.all():
        module.container = instance.parent
        module.save()

And this assumes that a module is assigned to just one container.

Just pushed updates to the models, along with the docs to include some of your comments here (also added to code comments) all changes here. I'm going to look at the other issues you addressed now!

Koeng101 commented 5 years ago

Questions and answers

  1. Yes, we should enforce lowercase and no spaces, no quotes (both double and single).
  2. Notes are just one long text field. Sounds good with readme
  3. On it after this
  4. Text seems like it would be most appropriate for sequences. It looks like it can store up to 1 GB, which is more than enough. Put another note in #11
  5. Great
  6. Great
  7. Looks good, very readable code
  8. Yes correct
  9. It is calculated on save with sha256, just because that is what I felt most comfortable using.
  10. Seems like that would be harder to maintain than like a bucket or directly storing in the database (since containers running on servers should be ethereal). Can you teach me why you think hosting on a server is the best way to go?
  11. Great
  12. Great
vsoch commented 5 years ago

Okay just added a function to generate the hash from the schema, which is called only if it's not None (which would throw an error by the Super class anyway):

import hashlib
import json

def generate_sha256(content):
    '''Generate a sha256 hex digest for a string or dictionary. If it's a 
       dictionary, we dump as a string (with sorted keys) and encode for utf-8.
       The intended use is for a Schema Hash

       Parameters
       ==========
       content: a string or dict to be hashed.
    '''
    if isinstance(content, dict):
        content = json.dumps(content, sort_keys=True).encode('utf-8')
    return "sha256:%s" % hashlib.sha256().hexdigest()

Just curious, what is it used for? Would it be a quick check (using the same function) to check if a schema is changed?

File Objects stored in Django are stored on the file system - I misunderstood you when I read this:

I think now storing files within the database makes the most sense, because we basically just have to store small text files and pdfs.

to think you meant using the Django FileField object (that points to a file). Django does provide a BinaryField but if you read the text in the green box:

Although you might think about storing files in the database, consider that it is bad design in 99% of the cases. This field is not a replacement for proper static files handling.

And I agree - if we want an ephemeral thing, we can't rely on any given filesystem. But I think I found a nice solution! For local development, we wouldn't want to require the developer to set up cloud storage - that's quite a burden to just test / develop something locally. But for production, we'd want the files to go there. I found a module, django-storage that would let us define file fields, but then based on custom logic in settings, for example, have it route to the local file system or Google Storage (or any other object storage!) What I'll do is develop with the FileObject for now, and when we move to the cloud somewhere add in the storage parts. :) It's really popular on GitHub and has frequent updates, so it's okay to use I think.

Koeng101 commented 5 years ago

The hash was mostly there to make sure every schema is unique since I couldn't add a "unique" constraint to JSON raw data. It's sort of a hack, but worked well enough.

Got it with django-storage, sounds good to me!

vsoch commented 5 years ago

Ah I'm glad I asked! We can just set unique=True and it will set the constraint for us :)