Restructuring the model

tdwg / camtrap-dp

Camera Trap Data Package (Camtrap DP)

https://camtrap-dp.tdwg.org

MIT License

46 stars 5 forks source link

Restructuring the model #203

Closed peterdesmet closed 1 year ago

peterdesmet commented 2 years ago

In a discussion with @tucotuco on how to better align Camtrap DP with a common model for biodiversity data, a proposal came up on how to better structure sequences in Camtrap DP.

Preamble

For the purpose of this discussion, I want to clarify what we mean by a sequence here:
1. It is a group of media files
2. It can be used as the basis of an observation (i.e. the group of image files was assessed as a whole, not unlike the frames of a video). The alternative is image-based observations, which come with some benefits (see point 4).
3. It is created after the media files were captured, based on a pre-defined sequence interval "Maximum number of seconds between timestamps of successive media files to be considered part of a single sequence". As a result, a sequence can contain multiple triggers/bursts. sequence interval is not a camera setting, but one by the programme used to manage the images afterwards.
4. It can be used as an "event" for biological analysis. The downside of sequence-based observations is that you are stuck with the sequence interval settings that were chosen. With image-based observations you can choose yourself how to group images together in logical events based on their timestamp.
5. Sequences typically don't result in a physical file. If they were, they would be like a gif/video looping through the originating files.
This proposal is not about whether image-based observations are better than sequence-based observations. The current situation is that both approaches exists (and likely will for a while) and Camtrap DP wants to support both.
The examples show a how the data would look for 3 images, using image-based vs sequence-based observations. In the first 2 images a wild boar (Sus scrofa) can be seen.

Current situation 0

Sequences only consists as identifiers (sequenceID), in both media and observations.
Observations have a sequenceID and mediaID, which are both foreign keys to the media table. Image-based observations need to populate both, sequence-based observations only sequenceID. As a result, joins between observations and media are conditional: you kinda need to know what key to use to make a join that will yield results. That is not great.
Because the join over media is conditional, we added the convenience terms deploymentID and timestamp to observations, so that they can be easily joined with deployments - without having to go over media - to get useful biological data (location, time, species).

Image-based observations

media.csv
mediaID | sequenceID | deploymentID | timestamp           | filePath
------- | ---------- | ------------ | ------------------- | --------
med1    | void_seq1  | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | void_seq1  | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | void_seq1  | dep1         | 2020-01-01T00:00:02 | med3.jpg

observations.csv
observationID | sequenceID | mediaID | deploymentID | timestamp           | observationType | scientificName | count | countNew
------------- | ---------- | ------- | ------------ | ------------------- | --------------- | -------------- | ----- | --------
obs1          | void_seq1  | med1    | dep1         | 2020-01-01T00:00:00 | animal          | Sus scrofa     | 1     | 1
obs2          | void_seq1  | med2    | dep1         | 2020-01-01T00:00:01 | animal          | Sus scrofa     | 1     | 0
obs3          | void_seq1  | med3    | dep1         | 2020-01-01T00:00:02 | blank           | NULL           | NULL  | NULL

Sequence-based observations

media.csv
mediaID | sequenceID | deploymentID | timestamp           | filePath
------- | ---------- | ------------ | ------------------- | --------
med1    | seq1       | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | seq1       | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | seq1       | dep1         | 2020-01-01T00:00:02 | med3.jpg

observations.csv
observationID | sequenceID | mediaID | deploymentID | timestamp           | observationType | scientificName | count | countNew
------------- | ---------- | ------- | ------------ | ------------------- | --------------- | -------------- | ----- | --------
obs1          | seq1       | NULL    | dep1         | 2020-01-01T00:00:00 | animal          | Sus scrofa     | 1     | NULL

Suggested change 1

In media.csv

Sequences are considered media (not unlike videos), they get their own rows in media.csv. The definition of that table becomes something along the lines of:

Table with media files associated with a deployment (deploymentID). Media files can be captured by the camera trap (images/videos) or created afterwards by grouping files.
Image and video files are still listed in media.csv. They have an optional parentMediaID to associate them with sequences. That allows joins to find the images that belong to a sequence.
Including sequence rows is entirely optional: there is no need to include them if you only have image-based observations. You could, if you want to convey somehow what grouping the system "used", but since they are not used as a basis of observation, you can leave the grouping into "events" entirely up to the user. All the information is there to do it.
filePath and fileMediaType become optional fields. They are typically not populated for sequence rows.

In observations.csv

Observations are only linked via mediaID. That media row can be a single image (image-based observations) or a sequence. This is a huge benefit, as it no longer required conditional joins.
Observations are no longer directly linked to deployments, to make it clear that they are derived from media objects. Since joins with media.csv are no longer conditional, it's quite easy to join observations -> media -> deployments. You do have to download the media.csv to do the join though.
Observations no longer require a timestamp field. That information can be found in media.

Most importantly, we think this model better represents the actual situation with camera traps: deployments → generate media → generate observations

Image-based observations

media.csv
mediaID | parentMediaID | deploymentID | timestamp           | filePath
------- | ------------- | ------------ | ------------------- | --------
med1    | NULL          | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | NULL          | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | NULL          | dep1         | 2020-01-01T00:00:02 | med3.jpg

observations.csv
observationID | mediaID | observationType | scientificName | count | countNew
------------- | ------- | --------------- | -------------- | ----- | --------
obs1          | med1    | animal          | Sus scrofa     | 1     | 1
obs2          | med2    | animal          | Sus scrofa     | 1     | 0
obs3          | med3    | blank           | NULL           | NULL  | NULL

Sequence-based observations

media.csv
mediaID | parentMediaID | deploymentID | timestamp           | filePath
------- | ------------- | ------------ | ------------------- | --------
seq1    | NULL          | dep1         | 2020-01-01T00:00:00 | NULL      <---- NEW ROW
med1    | seq1          | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | seq1          | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | seq1          | dep1         | 2020-01-01T00:00:02 | med3.jpg

observations.csv
observationID | mediaID | observationType | scientificName | count | countNew
------------- | ------- | --------------- | -------------- | ----- | --------
obs1          | seq1    | animal          | Sus scrofa     | 1     | NULL

Suggested change 2 (an less drastic update to the current situation)

This was suggested in https://github.com/tdwg/camtrap-dp/issues/203#issuecomment-1046656754. Comments above that are about suggested change 1 only.

ben-norton commented 2 years ago

I agree.

peterdesmet commented 2 years ago

Implemented at #204. To be discussed.

Open questions and my preference:

[x] should parentMediaID be empty when there is no parent? _yes, even when queries would be easier if it were populated, see: https://github.com/inbo/movepub/blob/71cd323b3b5af0c287c60b22ff1b34f38160054b/inst/sql/camtrap-dp/dwc_multimedia.sql#L58-L68_
[ ] rename "media" -> "evidence"? no
[ ] rename "media file" -> "media file or sequence"? no
[ ] rename captureMethod -> creationTechnique? yes
[ ] rename project level captureMethod perhaps
[x] rename parentMediaID -> sequenceID? no
[ ] use a start and end timestamp. no, can be derived from the media files (except video) and would needlessly inflate data.
[ ] remove project level classificationLevel? no, useful to know. Allow 2 values?
[ ] make project level sequenceInterval optional? yes, it is no longer a required field for image-based observations
[ ] ~rename start -> startTimestamp yes, and also in deployments~
[ ] ~rename end -> endTimestamp yes, and also in deployments~
[ ] skos for obs:mediaID remains http://purl.org/dc/terms/identifier? yes
[ ] skos for media:parentMediaID? don't know
[ ] skos for media:deploymentID = eventID? yes
[ ] ~skos for media:start is http://rs.tdwg.org/ac/terms/startTimestamp? tempting, but not really a ROI~
[ ] ~skos for media:end is http://rs.tdwg.org/ac/terms/endTimestamp? tempting, but not really a ROI~

yliefting commented 2 years ago

I'm in favor of this change. It simplifies things. Personally I need to get used to term mediaID but if you think of sequences as frames that could just as well have been a video it's easy to understand.

timrobertson100 commented 2 years ago

I also find this intuitive in terms of the class layout and relationships.

Sequences are considered media (not unlike videos), they get their own rows in media.csv

~A video is a piece of media, with a binary serialization in a format while here it's really just a field to group individual media files. Looking at the possible terms the only ones you'd anticipate maybe relevant are the timestamp and the possibly the comments and captureMethod. Would they ever exist on the sequence row or in different ways to the image media rows?~

~What I'm wondering is if having a row for the sequence brings any benefit, say to e.g. keeping the sequenceID column which is very intuitive.~ (answering my own question. It's needed to simplify the observation join)

Out of curiosity - are images ever manipulated, e.g. cropping out a section and creating a new image? If so, the parentMediaID seems very appropriate and intuitive.

I think it might be useful to add a type to Media (Image, SequenceOfImages, Video) to remove any assumptions. At the moment, you need to infer that because parentMediaID=null then it's a sequence, but if people create sub-images (e.g. cropping, adjusting brightness etc) that may not hold true.

peterdesmet commented 2 years ago

are images ever manipulated, e.g. cropping out a section and creating a new image? If so, the parentMediaID seems very appropriate and intuitive.

@timrobertson100 not necessarily to create a new physical medium. But subsections (bounding boxes) of images are quite common by e.g. AI to indicate where in the image it noticed an animal. That info can currently not be captured in Camtrap DP v1 (needs more thought). Options are to represent those as as sub-images (with a parentMediaID), but more likely is adding a bounding box field to the observation.

I think it might be useful to add a type to Media (Image, SequenceOfImages, Video) to remove any assumptions.

I agree. parentMediaID=null is not a good filter, because a dataset with image-based observations (only) would only contain images with parentMediaID=null too. Options:

capture/creationMethod (existing field): time lapse, motion triggered, add sequence. Change definition from "capture" to "how was it created". Also solves the issue of having to assign time lapse/motion triggered to the sequence level (what if contains mixed children?)
fileMediaType (existing field): image/jpeg, video/mp4, could add something like application/sequence. A bit of a hack, can one create mediaTypes? Solves same issue as captureMethod for mixed children.
type (new field): clearer, with control over vocabulary, but yet another field with similar information. Does not solve the problem with mixed children, but one could argue that those fields could be set to NULL or the most occurring one...

timrobertson100 commented 2 years ago

Mainly for reasons of keeping things intuitive, and to avoid mixing concepts I'd favor a type (or similarly named) field.

By mixing concepts, I mean that capture is related to what happened in the field to "trigger" the media existing, fileMediaType is about the encoding of the binary stream and sequence is really just a grouping of items largely for data management purposes (i.e. allow you to refer to a grouping of items in an annotation). Those seem like separate concerns to me which warrant their own field.

Aside: this model implies media would only ever exist in a single sequence unless you duplicate media records with e.g. the same filename (meaning observations are based on an image in a particular sequence and not on the image itself). I don't know enough to comment if that is appropriate.

jimcasaer commented 2 years ago

As far as concerns the second remark : that looks right to me : an image only exists in one single sequence - however, the same image can be the source for two different observations;

For me it still is a little bit confusing that, if I get it right, in the new data model in the media.csv there are some records referring to single images and others records referring to sequences that contain images that are listed in the same media.csv -table. It looks to me like two different levels of information are contained within the same table - not being a data scientist this is the first time I encounter this kind of a mixed-levels table in a data model :-)

tucotuco commented 2 years ago

It is a common modeling pattern to include multiple subtypes of an entity within a single table and to distinguish them with a type field to void having to create additional tables or hierarchical structures. Here that pattern seems well justified. Another part of that pattern is to name the type field based on the table it is in and concept it represents so that it can stand alone without context in a data dictionary (a glossary of terms). Based on these practices, I would recommend the term be adopted and that it be called "mediaType".

ben-norton commented 2 years ago

This may a bit overly cautious, but I'd opt for acquisitionType instead of mediaType to avoid confusion/overlap with the common use of mediaType as a reference to the MIME Media Types.

tucotuco commented 2 years ago

@ben-norton I think this probably arises from the media table serving multiple roles for the sake of simplification. I agree that the mediaType should be limited to media types - digital results. I think that still needs to be there. To me the acquisitionType is a statement about the event (something not explicitly modeled by the Camtrap DP structure) that generated the result. In a model that expresses this activity explicitly, I would indeed include something to specify that. In the GBIF publishing model we're doing in parallel, that would be an eventType.

peterdesmet commented 2 years ago

@tucotuco: It is a common modeling pattern to include multiple subtypes of an entity within a single table and to distinguish them with a type field to void having to create additional tables or hierarchical structures.

I'm not sure mixing (sub)types is that common. To me it is the biggest icky factor in an otherwise elegant proposal (cf comments by @jimcasaer @timrobertson100). I'd there for like to suggest an approach that deviates less from the current situation. For clarity, I'm also naming the proposals:

Current situation
Suggested change (with parentMediaID)
Suggested change below

Suggested change 2 (an less drastic update to the current situation)

Use the linear model (cf. suggested change 1): deployments -> media -> observations. No deploymentID or timestamp shortcuts in observations.
Keep sequenceID in observations.csv and media.csv (cf current situation).
Link observations.csv to media.csv either via mediaID or sequenceID. The fields should not be populated together. This is a change from the current situation, where sequenceID (even when not used) has to be populated, and it gives equal weight to both approaches. It does make joins conditional, but it is a pretty clear WHERE obs.mediaID IS NOT NULL vs WHERE obs.sequenceID IS NOT NULL
media.csv does not mix concepts: it only contains physical files. Sequences are a grouping identifier sequenceID (cf. current situation).
Image-based observations should not convey any sequence grouping in the dataset: the advantage of such dataset is that the user can define such event lengths

Image-based observations

media.csv
mediaID | sequenceID | deploymentID | timestamp           | filePath
------- | ---------- | ------------ | ------------------- | --------
med1    | NULL       | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | NULL       | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | NULL       | dep1         | 2020-01-01T00:00:02 | med3.jpg

observations.csv
observationID | mediaID | sequenceID | observationType | scientificName | count | countNew
------------- | ------- | ---------- | --------------- | -------------- | ----- | --------
obs1          | med1    | NULL       | animal          | Sus scrofa     | 1     | 1
obs2          | med2    | NULL       | animal          | Sus scrofa     | 1     | 0
obs3          | med3    | NULL       | blank           | NULL           | NULL  | NULL

Sequence-based observations

media.csv
mediaID | sequenceID | deploymentID | timestamp           | filePath
------- | ---------- | ------------ | ------------------- | --------
med1    | seq1       | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | seq1       | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | seq1       | dep1         | 2020-01-01T00:00:02 | med3.jpg

observations.csv
observationID | mediaID | sequenceID | observationType | scientificName | count | countNew
------------- | ------- |  --------- | --------------- | -------------- | ----- | --------
obs1          | NULL    | seq1       | animal          | Sus scrofa     | 1     | NULL

tucotuco commented 2 years ago

@peterdesmet I understand what you are trying to do, and even why. It only makes me cringe from a database modeling perspective where in SQL databases one tries to achieve the highest reasonable Normal Form (https://en.wikipedia.org/wiki/Database_normalization#Normal_forms) to protect against redesign problems with changes that might come in the future.

In Suggested Change 2 you are treating sequences as properties (albeit properties of two distinct entities), not as identifiers of an entity to use in the role of a key. The reason you can "get away with that" is that sequences have no non-identifying properties. So the thing that worries me (the "cringe factor") is that you are painting yourself into a corner. If you ever do add non-identifying properties to sequences in the future, you will have to repeat that information in media.csv or observations.csv or both, or add a sequence.csv with relationships to media and observations, and thereby change the structure in a way that will break existing implementations. Suggested change 1 doesn't overcome future-proofing sequences either, by the way, it treats them as one of the types of media with no properties of their own.

For demonstration only, a model that would future-proof sequences (and be in 5th normal form - 5NF) would be something like the following:

sequence.csv
sequenceID | deploymentD | starttimestamp
---------- | ----------- | -------------------
seq1       | dep1        | 2020-01-01T00:00:00
seq2       | dep1        | 2020-02-01T00:00:00

media.csv
mediaID | sequenceID | timestamp           | filePath
------- | ---------- | ------------------- | --------
med1    | seq1       | 2020-01-01T00:00:00 | med1.jpg
med2    | seq1       | 2020-01-01T00:00:01 | med2.jpg
med3    | seq1       | 2020-01-01T00:00:02 | med3.jpg

observationID | observationType | scientificName | count | countNew
------------- | --------------- | -------------- | ----- | --------
obs1          | animal          | Sus scrofa     | 1     | 1
obs2          | animal          | Sus scrofa     | 1     | 0
obs3          | blank           | NULL           | NULL  | NULL
obs4          | animal          | Sus scrofa     | 1     | NULL

mediaobservation.csv
mediaID | observationID
------- | -------------
med1    | obs1
med2    | obs2
med3    | obs2

sequenceobservation.csv
sequenceID | observationID
---------- | -------------
seq2       | obs4

jniedballa commented 2 years ago

Commenting here as a relative outsider to the project. Overall I think this goes in the right direction: deployments create media, media lead to observations. In my opinion sequences are an artificial add-on without any real benefits, but I never used it myself and also don't really how sequences are meant to be used in this standard, so I may be missing important points. Below are some general notes, concerns and questions to consider, and a suggestion for a somewhat different database system that may help accomodate sequences and other things. Apologies for a long post ahead.

Conceptual concerns

event definition for sequences: strictly speaking the actual event is the camera being triggered. So the natural way to create sequences/ group events would be via metadata which identify the trigger events. Such metadata info is not standardized and often not accessible though, which prevents this approach for many camera models.
sequence_interval as a workaround: I understand the motivation, but it is arbitrary, and hence deviates from the actual observation process already. It may accidentally join independent trigger events as a single sequence, or may separate images taken within a single trigger event (if e.g. 5 images are taken per trigger event and one happens to be a bit late). This may not be big issue in practical terms, but conceptually it is in my opinion. It makes sequences arbitrary and artificial.
sequence_interval should be saved in the project metadata
image metadata (timestamps, EXIF data) and annotations are specific to images, not sequences. A sequences is not a point in time, but a point in time with a duration. Properties can change during sequences (e.g. number of individuals) and it may be difficult to save that appropriately in sequence-based annotation.

Practical concerns

I agree media should only be files. Sequences are collections of files and thus on a different level of hierarchy. I believe they should be an independent entity. Considering images and sequences to be the same for convenience seems inconsistent.
most scientific analyses will later require another grouping / aggregation step, e.g. creating "independent" events (using another arbitrary threshold), or the multi-day sampling occasions in distribution models. So sequences as suggested here are an intermediate step at most. The analytical or practical benefit are not clear to me, but there are a number of disadvantages.
should image- and sequence based annotation exist in parallel, or are they mutually exclusive for a given data set?

I see three possible cases (with their data relationships):

Option A: strictly image-based annotation
- deployment --(deployment ID)-> media --(media ID)-> observations
Option B: sequence-based annotation (media files are present):
- deployment --(deployment ID)-> media --(media ID)-> sequences --(sequence ID) -> observations
Option C: sequence-based annotation (media files are not present):
- deployment --(deployment ID)-> sequences --(sequence ID) -> observations

A: easiest option. No sequences needed at all. If for some compatibility reason it is necessary to always have a sequence table, each media item can be considered a separate sequence and data structure would be identical to B (it would be redundant and a bit silly though).

B: can be created automatically from image-based annotation in A using sequence_interval (see below). It would only introduce an intermediate sequence table and sequence IDs in the observation table. If B is created from A, then B still implies A (as long as observations in B retain their mediaID). Not sure if that is relevant.

C: is this even necessary (can media.csv be missing)? Maybe relevant for old data sets?

The only real difference is: A: observations refer to media ID B: sequence table exists. observations refer to sequence ID, sequences refer to media.csv C: observations refer to sequence ID, which directly refers to deployments.

Would it be possible to set a flag in the project metadata as to which case it is (and thus, which key to use)?

Scope for automation?

if sequences lump media items based on sequence_interval, and sequence_interval is a user-defined time difference between media items, then sequences can be assigned automatically using the mediaID and media timestamps. I imagine a simple function that takes deployment, media and observation csvs as input, the user defines sequence_interval, and the function calculates time differences between media items and automatically assigns media items to sequences.
That would allow turning Option A into Option B automatically (see above)
I don't know if the opposite (B into A) could be automated also, but I guess so
Option C can't be turned into A or B since it doesn't have files.
The comment above doesn't answer how to store the information in the standard. I'm just saying it should be possible to partly automate creation of sequence assignment based on image assignment.

Videos

The points above are for images only. Video support in this scheme may lead to additional complications:

videos are sequences of images already, hence the definition of a sequences as a "group of media files" would not work for videos.
say a video is 30 seconds long. That is a 30-second sequence in itself, which might be longer than the user-defined "sequence interval". How would the definition of sequences deal with the duration of videos? Cut it into pieces? Or violate sequence_interval?
consider two videos of 30 sec each that were taken 1 second apart. Their timestamps are 31 seconds apart, even though the difference between end of the first and start of the second video is only 1 second. So if the user-defined sequence_interval is e.g. 2 seconds, would the two videos be one or two sequences?
would observations /annotation be specific to the entire video (which is a sequence in my opinion), or to individual frames (or short sections of the video), akin to image-based annotation?
timestamps in video metadata are not standardized, which is a headache in itself, but not an issue for the standard I suppose?

Suggestion

I suggest having a look at the database structure of digiKam for inspiration. I find it very clear, logical and extensible, but different from the current cameratrap DP scheme. If you have digiKam installed you can open its database in R with:

camtrapR:::accessDigiKamDatabase(db_directory = "C:/Users/YOURUSERNAME/Pictures", 
                              db_filename = "digikam4.db")

In short, it contains 5 items:

"AlbumRoots" - absolute paths of albums (image collections)
"Albums" - all directories within AlbumRoots, and how they relate to AlbumRoots
"Images" - all files (including videos) and how they relate to Albums
"Tags" - lists all available image tags, and their hierarchy (which allows nested tags)
"ImageTags" - assignment of tags to images

This is the content of each of these items as used by digiKam (not all of which would be needed for camera trapping data):

$AlbumRoots [1] "id" "label" "status" "type" "identifier" "specificPath"

$Albums [1] "id" "albumRoot" "relativePath" "date" "caption" "collection" "icon"

$Images [1] "id" "album" "name" "status" "category" "modificationDate" "fileSize" "uniqueHash" "manualOrder"

$Tags [1] "id" "pid" "name" "icon" "iconkde"

$ImageTags [1] "imageid" "tagid"

This scheme can be expanded nicely, e.g. a separate table for sequences (which assigns sequences to the file ids in the "Images" table - can maybe be created automatically as mentioned above). This would allow easily gathering of image tags (species IDs etc) and image information (timestamps etc) for sequences.

Future proofing for deep learning

It would also allow easy linking to AI / deep learning methods, e.g. with a separate table containing bounding box coordinates for object detection. This would work both for model training and model deployment, and can maybe be based on the COCO camera traps format. It would also remove the need to crop / duplicate images.

Then there can be another table containing the labels and confidence values for these bounding boxes. For model training this second table only needs one label, for predictions it can either contain the top label and probability only, or top k labels, or all labels with their probabilities.

Also, all these deep learning methods for image classification / object detection that I'm aware of use images, not sequences. Sequences can actually be harmful in this respect, especially for image classification (when the animal walked out of the frame during the sequence, but the entire sequence is labelled as a species). In object detection, bounding boxes for sequences also don't make sense. They need to be image-specific. *

* EDIT: COCO camera trap format allows both image and sequence-specific bounding boxes, which may not be precise at image-level (see link above). I find the statement that 'sequences are the "atom of interest" in most ecological applications' questionable though.

Video annotation at the file level should be no different than image annotation. I don't know how to annotate at the frame level.

peterdesmet commented 2 years ago

Thanks @tucotuco and @jniedballa! I had some time to digest this information and discussed it with @damianooldoni. We think the following suggestion would be a model that answers the issues. It will not solve - but can represent - the fact that some systems make observations at the level of "sequences/groups of images" (which restricts creating smaller events at the analysis stage).

Suggested change 3: 4th table, between observations and media

Add a 4th table that links observations to media. We suggest the name ~evidence~ mediaGroup. For image-based observations, it will contain 1-to-1 relations, for sequence-based observations, it will contain 1-to-many relations.
The mediaGroup table will always be present, so joins can always be made the same way: deployments -> media -> mediaGroup -> observations (+ group by). No conditional joins.
The biggest downside is that this table is superfluous for image-based-observations (because it will only contain 1-to-1). On the production side however, it is not that hard to create this table, since identifiers can be reused, e.g. populating mediaGroupID and mediaID with the same identifiers (see examples below). On the user side, it simplifies joins and allows to use a single model to represent different use cases. It also avoids the "paint yourself into a corner" problem @tucotuco pointed out with the more succinct representation in suggested change 2.
The mediaGroup table can potentially also represent parts of media files, e.g. bounding boxes or durations (see examples below). ~This is why we prefer the name evidence over e.g. mediaGroup.~ That conflicts somewhat with the name mediaGroup, but I find it still a more intuitive name.
Sequences are not considered media files.
The term sequence is avoided altogether, because it has different meanings. Here we use mediaGroup as the group, media file or part of media file that was used as the basis for an observation.
A column level (see below) could be added (with a controlled vocabulary) to more easily filter certain observations. For easier discovery, the metadata term classificationLevel could be updated to contain a list of all the levels a dataset contains.

Example:

obs1, obs2, obs3 are image-based observations. In med3 no animal was seen.
obs4 is a group-based observation. Media files med1, med2, med3 where assessed as a whole (a disadvantage for later analyses, but often occurring).
obs5 is made on a part of med3, i.e. a specific bounding box. It is considered a separate mediaGroup.
obs6 is an observation based on a part of a video, i.e. a specific duration with start and end timestamp.

media.csv
mediaID | deploymentID | timestamp           | filePath
------- | ------------ | ------------------- | --------
med1    | dep1         | 2020-01-01T00:00:00 | med1.jpg
med2    | dep1         | 2020-01-01T00:00:01 | med2.jpg
med3    | dep1         | 2020-01-01T00:00:02 | med3.jpg
med4    | dep1         | 2020-01-04T08:00:00 | med4.mov

mediagroups.csv
mediaGroupID | mediaID | level    | boundingBox        | timeRange
------------ | ------- | -------- | ------------------ | ---------
med1         | med1    | file     |                    |  
med2         | med2    | file     |                    |  
med3         | med3    | file     |                    | 
seq1         | med1    | group    |                    | 
seq1         | med2    | group    |                    | 
seq1         | med3    | group    |                    | 
bbox1        | med1    | bbox     | [x,y,width,height] | 
duration1    | med4    | duration |                    | start/end

observations.csv
observationID | mediaGroupID | observationType | scientificName | count
------------- | ------------ | --------------- | -------------- | -----
obs1          | med1         | animal          | Sus scrofa     | 1
obs2          | med2         | animal          | Sus scrofa     | 1
obs3          | med3         | blank           | NULL           | NULL
obs4          | seq1         | animal          | Sus scrofa     | 1
obs5          | bbox1        | animal          | Sus scrofa     | 1
obs6          | duration1    | animal          | Vulpes vulpes  | 1

@jniedballa sequence_interval is currently saved in the project metadata. But maybe we should allow more flexible ways to indicate how "mediaGroups" were created.

danstowell commented 2 years ago

@peterdesmet as a relative outsider I like the look of this new "suggested change 3" better than previous ones. It seems correct to me that sequences are not considered media files.

Your bounding box example is clear; I can see that the format also allows for an observation which is based on a bounding-box that moves/changes shape over the duration of a sequence (this is one "tricky case" we discuss sometimes). But then would the level be bbox or group? My solution to that would be to forget having bbox as an explicit type: it can be implicit from the fact that the boundingBox column is non-null. (I'm dubious about the need for the level column at all, but I presume you're suggesting it for ease of data consumption.)

peterdesmet commented 2 years ago

I'm dubious about the need for the level column at all, but I presume you're suggesting it for ease of data consumption.

Yes indeed. It doesn’t necessarily need to be there.

peterdesmet commented 2 years ago

Alternative name for evidence: observationUnit.

peterdesmet commented 2 years ago

@danstowell could we consider that 4th table a "region of interest" (Section 7.11 of https://ac.tdwg.org/termlist/)?

Regions of Interest (ROI) designate specific parts of media items.

Could a region of interest also be larger than a single image file?

danstowell commented 2 years ago

@danstowell could we consider that 4th table a "region of interest" (Section 7.11 of https://ac.tdwg.org/termlist/)?

Regions of Interest (ROI) designate specific parts of media items.

Could a region of interest also be larger than a single image file?

We always intended that an ROI could cover multiple frames, but we have not worked out the details. In practice I think the AC definition of ROI is all about a hyper-rectangular box (e.g. imagine a box confined in the x, y, z and time axes), whereas what's nice about your proposal is that an observation is composed of a sequence of different* ROIs, one per frame. A sequence of different ROIs is not a hyper-rectangular box. Thus: I think the 4th table is not equivalent to an ROI.

I would say that your columns timeRange and boundingbox are closely tied to AC's notion of ROI.

(...or an arc)

danstowell commented 2 years ago

FWIW I'm OK with mediaGroups. (I prefer it over observationUnit)

kbubnicki commented 2 years ago

Hi all and sorry for this late feedback! Great discussion! I have spent some time recent days thinking about the last proposal and have had the meeting with @peterdesmet this morning. Here is the outcome; below you will find two new proposals that (hopefully) still add something to our discussion:

Suggested change 4: 4 tables (similar to the Suggested change 3 with some modifications)

media.csv
| mediaID     | deploymentID | timestamp           | filePath |
|-------------|--------------|---------------------|----------|
| med1        | dep1         | 2020-01-01T00:00:00 | med1.jpg |
| med2        | dep1         | 2020-01-01T00:00:01 | med2.jpg |
| med3        | dep1         | 2020-01-01T00:00:02 | med3.jpg |
| med4        | dep1         | 2020-01-04T08:00:00 | med4.mov |

mediagroups.csv
| mediaGroupID | mediaID |
|--------------|---------|
| med1         | med1    |
| med2         | med2    |
| med3         | med3    |
| med4         | med4    |
| seq1         | med1    |
| seq1         | med2    |
| seq1         | med3    |

observations.csv
| observationID | mediaGroupID | observationLevel | observationType | scientificName | count | individualID | boundingBox                                     | timeRange   |
|---------------|--------------|------------------|-----------------|----------------|-------|--------------|-------------------------------------------------|-------------|
| obs1          | med1         | file             | animal          | Sus scrofa     |     1 |              |                                                 |             |
| obs2          | med2         | file             | animal          | Sus scrofa     |     2 |              | [[x1,y1,width1,height1],[x2,y2,width2,height2]] |             |
| obs2a         | med2         | file             | animal          | Sus scrofa     |     1 | ind1         | [[x1,y1,width1,height1],]                       |             |
| obs2b         | med2         | file             | animal          | Sus scrofa     |     1 | ind2         | [[x2,y2,width2,height2],]                       |             |
| obs3          | med3         | file             | blank           | NULL           |  NULL |              |                                                 |             |
| obs4          | seq1         | sequence         | animal          | Sus scrofa     |     2 |              |                                                 |             |
| obs5          | med4         | file             | animal          | Sus scrofa     |     1 |              | [[x,y,width,height],]                           | start1/end1 |
| obs6          | med4         | file             | animal          | Sus scrofa     |     1 |              | [[x,y,width,height],]                           | start2/end2 |
| obs7          | med4         | file             | animal          | Sus scrofa     |     1 |              |                                                 | start/end   |
| obs8          | med4         | file             | animal          | Sus scrofa     |     1 |              |                                                 |             |

We keep the mediagroups.csv table. The advantages are that we can mix sequence- and file-based observations in one package and that this table can be easily extended when needed in the future.
The attributes boundingBox (spatial window; now 2D array) and timeRange (temporal window) are moved to the observations.csv table. I find both attributes more related to the observation than media-grouping process. Think about two-stage observation process: i. animals (or other objects as humans, vehicles etc) detection in space (boundingBox) and/or time (timeRange) -> ii. classification (observationType, scientificName etc). The advantage of this change is also that the mediagroups.csv table will be more "compressed" as it will not have rows for each single detected object (bounding box) and/or video-frame e.g. imagine 10k videos * 60 1s frames classified by AI and each containing from 1-10 wild boar. In the previous proposal both mediagroups.csv & observations.csv tables would quickly grow enormously in similar scenarios.
In the observations.csv table there is a new attribute observationLevel - this is just for user's convenience (e.g. quick selection of file-based observations only).
This proposal (as well as the next one) supports the following cases: a) file-level observations -> obs1 (image) and obs8 (video) b) file-level & object-based image observations -> obs2 (multiple objects of the same type on 1 image), obs2a & obs2b (different objects on 1 image, separate rows), c) file-level & object-based video observations -> obs5 & obs6 (same or different objects detected on separate video frames; both spatial and temporal window defined), obs7 (only temporal window of an observation defined); please note that a similar logic can be applied to audio files d) sequence-based observations -> obs4
Maybe a trivial comment, but an interesting side-effect of having mediaGroupID for file-based observations is that one can define mediaGroupID for pairs of images from 2-cameras deployments e.g. when monitoring lynx, tigers or some other "marked" animal species, where both cameras typically record media of the same individual (e.g. left & right side of an animal passing a forest path):

| mediaID     | deploymentID | timestamp           | filePath |
|-------------|--------------|---------------------|----------|
| med1a       | dep1a        | 2020-01-01T00:00:00 | med1.jpg |
| med1b       | dep1b        | 2020-01-01T00:00:00 | med1.jpg |

| mediaGroupID | mediaID |
|--------------|---------|
| med1         | med1a   |
| med1         | med1b   |

Suggested change 5: 3 tables (similar to the original model with some modifications; developed interactively during the meeting with Peter)

Sequence-based example

media.csv
| mediaID | mediaGroupID | deploymentID | timestamp           | filePath |
|---------|--------------|--------------|---------------------|----------|
| med1    | seq1         | dep1         | 2020-01-01T00:00:00 | med1.jpg |
| med2    | seq1         | dep1         | 2020-01-01T00:00:01 | med2.jpg |
| med3    | seq1         | dep1         | 2020-01-01T00:00:02 | med3.jpg |
| med4    | seq2         | dep1         | 2020-01-04T08:00:00 | med4.mov |

observations.csv
| observationID | mediaGroupID | observationType | scientificName | count | individualID | boundingBox | timeRange |
|---------------|--------------|-----------------|----------------|-------|--------------|-------------|-----------|
| obs1          | seq1         | animal          | Sus scrofa     |     2 |              |             |           |
| obs2          | seq2         | animal          | Sus scrofa     |     1 |              |             |           |

File-based example

media.csv
| mediaID | mediaGroupID | deploymentID | timestamp           | filePath |
|---------|--------------|--------------|---------------------|----------|
| med1    | med1         | dep1         | 2020-01-01T00:00:00 | med1.jpg |
| med2    | med2         | dep1         | 2020-01-01T00:00:01 | med2.jpg |
| med3    | med3         | dep1         | 2020-01-01T00:00:02 | med3.jpg |
| med4    | med4         | dep1         | 2020-01-04T08:00:00 | med4.mov |

observations.csv
| observationID | mediaGroupID | observationType | scientificName | count | individualID | boundingBox                                     | timeRange   |
|---------------|--------------|-----------------|----------------|-------|--------------|-------------------------------------------------|-------------|
| obs1          | med1         | animal          | Sus scrofa     |     1 |              | [[x,y,width,height],]                           |             |
| obs2          | med2         | animal          | Sus scrofa     |     2 |              | [[x1,y1,width1,height1],[x2,y2,width2,height2]] |             |
| obs2a         | med2         | animal          | Sus scrofa     |     1 | ind1         | [[x1,y1,width1,height1],]                       |             |
| obs2b         | med2         | animal          | Sus scrofa     |     1 | ind2         | [[x2,y2,width2,height2],]                       |             |
| obs3          | med3         | blank           | NULL           |  NULL |              |                                                 |             |
| obs4          | med4         | animal          | Sus scrofa     |     1 |              | [[x,y,width,height],]                           | start1/end1 |
| obs5          | med4         | animal          | Sus scrofa     |     1 |              | [[x,y,width,height],]                           | start2/end2 |
| obs6          | med4         | animal          | Sus scrofa     |     1 |              |                                                 | start/end   |
| obs7          | med4         | animal          | Sus scrofa     |     1 |              |                                                 |             |

1) There is no mediagroups.csv table. Basically, we go back to the original model (v0.1.7, https://github.com/tdwg/camtrap-dp/tree/0.1.7) but there are some critical differences. 2) There are new attributes boundingBox and timeRange in the observations.csv table (described above). 3) There is no deploymentID in the observations.csv table which makes the entire model more linear. 4) There is a new attribute mediaGroupID in the media.csv table. 5) The Camtrap DP packages should be either file-based or sequence-based (as indicated in the package-level metadata). It is not necessarily a limitation of this proposal; Camtrap DP has been designed as a standard for data exchange/publishing at a level of a single camera trapping project where typically people do not mix both annotation approaches. 6) The biggest advantage of this proposal I see it is the simplicity of the model (no 4th table) and its human-user-friendliness. Also the flexibility is still there, I believe most of the use-cases (as listed above) are covered with this design.

@peterdesmet Please edit this comment if you find that I have missed sth (or if sth is not clear enough)!

Best, K

peterdesmet commented 2 years ago

Thanks @kbubnicki, great summary of our discussion. I just want to add that in suggestion 4 the number of records in mediagroups is always going to be the same as there are records in media (given you never mix a file and sequence based approach, which is a good limitation in my opinion). Knowing that, we can simplify things, which resulted in suggestion 5:

merge media with mediagroups into one table
no need for an observationLevel term, since it will be the same for all records within a dataset and hence should be a dataset metadata property.

I’m all in favour of suggestion 5. Feedback welcome, especially from those that commented already @tucotuco @danstowell @jniedballa …

danstowell commented 2 years ago

I'm not so excited by the idea of moving the bboxes into the observations table, for the reason that it then fails to support one of the important use cases we have here: objects detected in image-sequences, with a different bbox in each image, and then one overall identification applied to that sequence of bboxes. This is a real example from our insect-cameras, and probably occurs in plenty of other systems with bboxes tracked over time.

A workaround would be to repeat multiple rows in observations for each frame in this sequence, but that's tricky because we then wouldn't want users to sum the count column and over-count.

I can't comment on the file-size implications.

You write "Think about two-stage observation process" (detect, then identify) but to me that doesn't motivate the change.

A separate and minor comment: I suggest that the arrays-of-bboxes format might be a bit troublesome for data consumers - it's starting to look like structured data inside a CSV cell.

tucotuco commented 2 years ago

I don't have a lot of time to comment in detail (i.e., offer alternative solutions) right now.

Suggested change 4 does not have anything to offer that suggested change 5 doesn't have if mediagroups do not have distinct attributes.
I agree with @danstowell that neither of these suggested changes allows an observation to be derived from a media in a mediagroup that has more than one media. Changing the column mediaGroupID in observations to point to media rather than mediagroups would take care of that problem, but would not make it possible to allow observations on mediagroups directly. Solving both simultaneously takes me back to suggested change 1 where photos, photo sequences, and videos are all just media instances, distinguished by a type.
to avoid confusion, I would make sure the ids for mediagroups do not overlap with the ids for media (e.g., put medgr1 in place of med1 when med1 refers to a mediagroup).

tucotuco commented 2 years ago

Just had a chat with @peterdesmet about my most recent comments. If it will be a rule that data sets must be either of observations from media or observations from mediagroups, but never both, then my second concern doesn't really apply. Similarly, if data sets are never mixed, then the mediaID could act as a mediaGroupID for the sake of practicality (not having to mint another identifier). I cringe in terms of semantics (it was rejected that mediagroups were just a type of media), but that shouldn't matter until/unless these data start to be linked semantically.

ben-norton commented 2 years ago

I think the stipulation that a dataset is either sequence-based (observation - mediaGroup) or image-based (observation - media) is a fair stipulation that solves a number of problems. Since most datasets don't utilize multiple observation techniques (e.g., expert identification and computer vision model), adoption shouldn't be overly problematic for most providers. Several projects arrived at this same conclusion (after months of debate). To my knowledge, field testing this solution hasn't resulted in any significant problems. One important note. Aside from the logistics and organization of the model, the impact of this resides in the analysis. To combine sequence and image based observations for modelling purposes, the calculation technique for the number of unique individuals over a given period of time is crtical. The irony is that the image-based observations will be grouped over a specific time-interval for modeling purposes. In other words, its all sequences in the end.

kbubnicki commented 2 years ago

A workaround would be to repeat multiple rows in observations for each frame in this sequence, but that's tricky because we then wouldn't want users to sum the count column and over-count.

@danstowell Thats why we have this field in Camtrap DP: https://tdwg.github.io/camtrap-dp/data/#observations.countnew

We use this field when annotating our camera trap records to track information about a "real" group size of animals staying for a while in front of a camera trap (or just passing it by). This applies to image-level annotation and prevents over-counting when aggregating data for analysis.

peterdesmet commented 2 years ago

Hi all, I picked up this dormant issue with John Wieczorek (@tucotuco) in an effort to reach a recommendation. We mainly discussed the pros and cons of two of the main proposals suggested above:

mediaGroupID (Suggested change 5: https://github.com/tdwg/camtrap-dp/issues/203#issuecomment-1106706615)
sequences as media (Suggested change 1: https://github.com/tdwg/camtrap-dp/issues/203#issue-1139131068)

I also compared how one would query data using either model, at https://github.com/peterdesmet/camtrap-dp-query-test (repository likely to be deleted at some point).

Recommendation

Our conclusion is that the mediaGroupID approach (Suggested change 5):

Is stricter (more validation options, dataset is either sequence or image-based)
Is easier to explain to the publisher (less options or decisions to take)
Is simpler to query for the user

And thus a reasonable simplification of the model. It is an improvement over the current model (where information is needlessly repeated) and plays well with the unified common model. It allows to express bounding boxes (at the level of observations). If I read the comments above, this proposal is something that @kbubnicki @ben-norton @jniedballa and now @tucotuco could get on board with. I will create a pull request with the suggested changes. Thank you all for your patience and for participating in this discussion!

@danstowell you liked the possibilities of the 4th table approach - maybe especially as a model for Audubon Core - but for Camtrap DP we believe it would needlessly complicate things as an exchange format. Hope you understand.

Rename to eventID

One change we suggest is to rename mediaGroupID to eventID. As in, this is the event the data publisher choose to group their observations by. For image-based (recommended approach), the selected events are the duration of the media file (image or video), for sequence-based, the selected events are sequences. In software you can always create larger events (by grouping), but never smaller events.

Image-based (if we reuse identifiers):

media.csv
mediaID | eventID
------- | -------
med1    | med1
med2    | med2

observations.csv
observationID | eventID
------------- | -------
obs1          | med1
obs2          | med2

Sequence-based:

media.csv
mediaID | eventID
------- | -------
med1    | seq1
med2    | seq1

observations.csv
observationID | eventID
------------- | -------
obs1          | seq1

Advantage over mediaGroupID as a name: eventID is more neutral in observations.csv and doesn't imply the media will be grouped (i.e. they aren't in file-based observations)
Advantage over sequenceID as a name: sequenceID is a confusing term in observations.csv for file-based observations.

peterdesmet commented 2 years ago

Quick update: we are still working on restructuring the model. The current approach is to abandon trying to capture image vs event-based annotation in a single observations table, but to work with an eventobservations and imageobservations table (in addition to a media and deployments table).

The main advantage is clarity: easier for the user to understand and easier for us to document. Additionally, it allows to export both approaches in a single package, e.g. AI image-level observations that underpin event-level consensus observations.

We are currently testing this approach and hammering out the details.

peterdesmet commented 1 year ago

The suggested change (splitting the observation table) has been implemented in #289. All who participated here are welcome to review the changes.

peterdesmet commented 1 year ago

Fixed in Camtrap DP 0.6 #297.

ben-norton commented 1 year ago

Congrats. That's a very challenging task.