Figure out DB schema - Githubissues

nathanielrindlaub commented 4 years ago

We need to come up with data models for:

users
permission groups
projects
cameras
deployments
models
predictions
labels
...

Here's a good mongoDB resource on examples of various data models - https://docs.mongodb.com/manual/applications/data-models/

Also would be worth reviewing Wildlife Insight’s schema and the COCO Camera trap format used by Megadetector.

As well as the table of available metadata we can extract - https://docs.google.com/spreadsheets/d/1A7psosw8mZW_vsB7gVjC-e7KVSoRFlYEyFhDvGFCM-M/edit#gid=0

nathanielrindlaub commented 4 years ago

Also, I have in my notes that @postfalk mentioned that we'll need to implement a "validation layer" for a "schema-less" MongoDB structure, but I can't remember what that means...

postfalk commented 4 years ago

Let's talk about this in person. We should think about it a little bit because one thing you shouldn't do with schema-free is to emulate relational databases. What that means we should talk about.

nathanielrindlaub commented 4 years ago

Ok great. Let's talk tomorrow or early next week.

postfalk commented 4 years ago

3pm?

nathanielrindlaub commented 4 years ago

MongoDB has a fantastic series of online courses called MongoDB University. I've started the Data Modeling course and completed the section on relationships. The takeaways were:

One to many relationships e.g., one person can own many credit cards, each credit card belongs to just one person.

TRY TO EMBED AS MUCH AS POSSIBLE FOR SIMPLICITY - INFORMATION THAT IS NEEDED TOGETHER, STAYS TOGETHER
Usually you want to embed on the side that is most queried
Referencing is ok in situations where you don’t need to access the info in the many entities too much. If you are referencing from the one side, an array of reference Ids can be good, especially if you IDs that are descriptive enough to be sufficient for fulfilling most basic needs (so there's typically no need to actually do the join). More commonly, you’d reference from the many side, so you don’t have to go update the one side when you delete a many entity.

Many to many relationships e.g., movies and actors, carts and items

In general, try to simplify many to many relationships into one-to-many relationships, even if it involves some duplication
Like one to many, in a many to many embed, you also embed on the most queried side (cart), and you’d embed an item as a subdocument. However, it's common to keep a separate “source” collection of just the items, independent of (no reference to) carts, because you might have operations where you want to just see what the available items are. Also, items/actors can exist before being added to a cart/movie.

It all pretty much maps up with exactly what you were telling me @postfalk - thank you! Although, technically, MongoDB doesn't describe itself as schema-less because they do have and encourage schemas, they're just far more flexible than traditional relational databases, and they don't like referring to "relational databases" as "relational" either, because mongoDB also often does represent relationships. They call relational databases "tabular databases". IDK mostly just semantics but they do kind of position themselves as something of a hybrid/best-of-both worlds solution.

nathanielrindlaub commented 4 years ago

@postfalk, I this is what I have so far for a first pass at an image document schema. Mongoose will compile all of these schema into one document, so the separately defined schema you see here like LocationSchema are not referenced, they will get nested into the ImageSchema as sub-documents. Let me know what you think.

let LocationSchema = new Schema(
  {
    location_id:    { type: String, required: true },
    description:    { type: String },
    coordinates:    { type: [Number] },
    altitude:    { type: Number },
    azimuth:    { type: Number },
  }
);

let CameraSchema = new Schema(
  {
    make:    { type: String },
    model:    { type: String },
  }
);

let DetectionSchema = new Schema(
  {
    category:    { type: String },
    conf:    { type: Number },
    bbox:    { type: [Number], requred: true },
    detection_date:    { type: Date, default: Date.now, required: true },
    validated:    { type: Boolean, default: false, required: true },
    // TODO: figure out how to ID models. Might be a good candidate to reference a separate entity
    // model_id:         {},
  }
);

let LabelSchema = new Schema(
  {
    type:    { type: String, requried: true },  // manual vs. ml
    conf:    { type: Number },
    bbox:    { type: [Number] },
    labeled_date:    { type: Date, default: Date.now, required: true },
    validated:    { type: Boolean, requried: true },
    // TODO: figure out how to ID models. Might be a good candidate to reference a separate entity
    // model_id:         {},
  }
);

let ImageSchema = new Schema(
  {
    image_id:    { type: String, required: true },
    serial_number:    { type: String, required: true },
    file_name:     { type: String, required: true },
    file_path:    { type: String, required: true },
    date_added:    { type: Date, default: Date.now, required: true },
    date_time_original:    { type: Date, required: true },
    image_width:   { type: Number },
    image_height:    { type: Number },
    megapixels:    { type: Number },
    mime_type:    { type: String },
    user_label_1:    { type: String },
    user_label_2:    { type: String },
    camera:    { type: CameraSchema },
    location:    { type: LocationSchema },
    detections:    { type: [DetectionSchema] },
    labels:    { type: [LabelSchema] },
  }
);

postfalk commented 4 years ago

Thanks that looks great. Thanks also for finding out that Mongoose implements application level model validation.

Here are a few thought.

I am not sure whether we need the difference between detection and classification schema because this distinction is artificial and depends on the question. The difference of calling something an animal or fox lies in the quality of the model and the interest of the researcher (e.g. one might call an algorithm that finds fox a detector and another one that can differentiate between individual foxes a classifier.

In terms of storing the detection vs. classification data, you could either store {'label': None, 'bbox': ...} or {'label': 'animal', 'bbox': ...}. Btw. your label model misses a category field while your detection model has a category field. I would also add a label source field which could be either a model or a persons name (or store these with two different keys). A challenge of course will be to keep track of model versions but maybe that takes it too far.

Not sure what the conf field is for.
Use "geometry": {"type": "Point, "coordinates": []} instead of coordinates to adhere to convention and allow for other geometries than point. That might come in handy later for various reasons, e.g. to obscure an exact camera position or using Mongo's spatial query functionality (check Mongo for spatial functionality).
Image schema: megapixels is redundant since it is height * width. I think user_label_1 and user_label_2 comes from EXIF or is that already a third party classification from the camera provider? In which case that should go into the classification model. I would create a field user_data (dobject, dict, or document) where we could put anything we would like to keep.
file_name and file_path should be a single string. Ideally, it would be an URL that points to a persistent location on the web. Or if you have a situation where you might move some images around e.g. for dev vs. production, you might allow for relative and absolute paths. E.g. /mages/foxes where relative paths get autocompleted by a value in your settings.
bbox I am really interested in looking into what the convention is. Are we storing bounding boxes absolute (values depending on the actual image resolution) or relative i.e. applicable to any resolution of that image. I DON'T have the answer just wondering).
Hasing. I would store an image hash. There is a cryptic hash like md5 which would not survive any change in the image file (like resolution). There are also phases https://www.phash.org/ which could identify images even after manipulation but that might be a) to compute intensive and b) our images might be too similar to each other to make that reliable. I think md5 is good for now.

nathanielrindlaub commented 4 years ago

Hey @postfalk thanks so much for all of this excellent feedback. All super helpful. My responses to your thoughts below:

1) Good points. I totally agree and will scrap the "Detection" model in favor of the generic "Label" model. Megadetector actually does do some high level classification (empty, animal, vehicle, person, group of animals) so we'll have labels for images run through the detector too.

2) conf is confidence of the prediction; it's a value returned by megadetector. For every image we actually get 100 predictions, but most are very very low confidence. Right now I'm only keeping predictions that have > 80% confidence.

3) Ok great, will do.

4) Awesome, I'll get rid of megapixels. user_label_1 and user_label_2 are generic placeholders for string values users can set on cameras when they deploy them. On Reconyx and Buckeyes the custom text fields gets stored in the EXIF data. But I like the user_data field idea I'll do that.

5) Ok cooI. Right now, I store the image in S3 with it's hash (md5) as it's object key, so the s3 bucket is completely flat, and the paths look like 028eb55d7f963521601ad51b28975bcd.jpg. I left the file_name key in the model because I thought maybe I would grab the original file name before it got changed to a hash and store that too, but I'm not sure that's useful at all. I'll get rid of it.

6) The default for bbox is an array with [x, y, box_width, box_height], and the coordinates are normalized between 0 and 1 (0,0 being the upper left corner, 1,1 bottom right). I think there's a setting that allows you to get absolute values in pixels though.

7) should I store the has separately even if that's what I'm using as the persistent path in s3?

Thanks again for all taking such a thorough look and for all of the thoughtful feedback.

postfalk commented 4 years ago

Great
Thx
Thx
Thx 5 and 7. consider (and maybe accommodate the case where images might be stored outside our system)
It is a really good idea to maintain the original file name so that people can reference their pictures that might sit on a sd card. A little bit more tricky questions are folder names etc. Let's chat about that.
not quite sure see also above re external storage locations

tnc-ca-geo / animl

Figure out DB schema #2