pantheracorp / PantheraIDS_Issues

A repository for any issues and bugs related to PantheraIDS
0 stars 0 forks source link

Data Processing / Bulk Image Export - Missing images in image export (due to multiple rows with the same FileName_New) #152

Closed Shannon-Dubay closed 3 years ago

Shannon-Dubay commented 3 years ago

Working on v.2.21.886 beta, trying to export server-based images for S3_2014- the database table shows 120 spotted hyaenas but when I export them I only get 118 exported. The green message box at the top of the page says "Success! 120 images exported". Nothing is written in the log. I can load all 120 server-based images in the Classify module, so I don't suspect it is an issue of missing images. So far I have tried a few other exports (S3_2019 and S5_2014) which have worked perfectly, so at this point I believe it is a database-specific issue.

Shannon-Dubay commented 3 years ago

I assumed this was a database-specific issue, but it appears not.

Jo posted the following on the forum: _Using version 2.23.890, "Export" module, when exporting a selected species, it export an incorrect number of images (S252016, Serval = 37 images exported, while when same survey pulled in "Tables" it contains 39 images).

Again, when re-creating this issue on my side, I see that the green message at the top shows the correct number of images, however there are images missing in the actual export.

This case is also done using the images on the server- perhaps there are missing images on the server? We could compare how many images are exported for these surveys using local images instead. S25_2016 and S3_2014 local images are both on M1 (on campus, but one of the students could retrieve it for us next time they're on campus), if we want to try to compare outcomes, or total number of local vs server images for these surveys.

Or: These databases are all older, perhaps they are formatted in a way that is causing this?

Shannon-Dubay commented 3 years ago

Information from Jo regarding the issue, this might be the answer:

This make me think of that other issue encountered WRT images wrongly processed and renamed with the same "File Name New", I dug deeper, and indeed there are 2 images with the same name, which mean during the export it over writes them.

RossPitman commented 3 years ago

So then this isn't an issue with the exporter then right? Given Jo's latest info. Images would overwrite if they are called the same thing.

Shannon-Dubay commented 3 years ago

I haven't checked for the survey I was struggling with, but yes, if Jo is right then it wouldn't be an exporter module issue.

Shannon-Dubay commented 3 years ago

Yes, I am experiencing the same for S3_2014: For S3_2014, there are 2 images for S3Station30Camera1CAM442852014-03-3121-15-04.JPG (trigger ID: S3_20140329_20140512Station30CAM442853) and for S3Station17Camera2CAM408772014-05-0301-34-29.JPG (trigger ID: S3_20140329_20140512Station17CAM40877315), which account for the discrepancy of images exported.

How many other databases are impacted by this?! I suppose a check could be done by checking how many databases contain a "2" in the "ImageNumber" column of the trigger table? And how many other modules will need a similar fix to this? How do the Classify and Manual Identification modules work around this?

RossPitman commented 3 years ago

We don't fix this, at least not with older databases (pre-2017). A lot of this is a result of legacy CameraBase, which couldn't handle images taken at the same time. It would collapse them into one.

RossPitman commented 3 years ago

We later implemented another fix that added a unique integer to the end of the triggerid, so some of the older databases don't have that.

Shannon-Dubay commented 3 years ago

OK, so there is no fix for these databases, and those images are lost? Is there anything specific I should report back to Jo?

RossPitman commented 3 years ago

Lets do a little more investigating, just to be sure. So perhaps lets confirm that this is occurring on only older databases. Then perhaps lets see if with some of these databases we actually see the issue that we suspect. Once we're happy with that, then we tell Jo.

Shannon-Dubay commented 3 years ago

So it appears that sometime in either late 2018 or early 2019 the idea of adding the the "ImageNumber" to the file name solved this problem, however it appears that (I believe all) databases from 2018 and before will be impacted by this problem. I checked one database per year and the first one I checked had this issue. All databases from 2019 and 2020 were checked but appear to be fine due to the added suffix in the file name.

I apologize in advance for the following (its all the same text copied and pasted with the details changed): Checked: S7_2014, Honey Badger has 2 instances of 2 images with the same "FileName_New" and "Trigger ID" (although different "ImageNumber"s)- export module and tables module says there are 74 honey badger images but only 72 images are exported. S15_2015, Honey Badger has 7 instances of 2 images with the same "FileName_New" and "Trigger ID" (although different "ImageNumber"s)- export module and tables module says there are 52 honey badger images but only 45 images are exported. S16_2016, Vehicle at Station38 has 1 instance of 2 images with the same "FileName_New" and "Trigger ID" (although different "ImageNumber"s)- export module and tables module says there are 180 vehicle images at Station38 but only 179 images are exported. S1_2017, Vehicle at Station2, has 1 instance of 2 images with the same "FileName_New" and "Trigger ID" (although different "ImageNumber"s)- export module and tables module says there are 461 vehicle images at Station2 but only 460 images are exported. S2_2018, Human at Station26, has 1 instance of 2 images with the same "FileName_New" and "Trigger ID" (although different "ImageNumber"s)- export module and tables module says there are 115 human images at Station26 but only 114 images are exported.

S6_2019, S11_2019, S4_2019, S3_2019, S4_2020, S6_2020 all have 0 instances of images with the same "FileName_New" - this is due to "ImageNumber" being added as suffixes to the end of the file name. Trigger IDs are still shared (doesn't appear to be a problem). Exporter works as intended.

RossPitman commented 3 years ago

Thanks for this!

PhilipFaure commented 3 years ago

There are duplicate filenames (FileName_New) in the database for (what appears to be) the same server image, meaning that on the server there is only one JPG file for two records in the database. I don't have the local images, so I can't determine whether there perhaps were two of the same images (for these database duplicates) in the original backed up image folders.

But, given that these are all surveys done with Panthera Cams (I think), which don't take any burst images (I believe), may suggest that there is no data loss since PantheraCams only take one photo per trigger? @RossPitman @Shannon-Dubay what do you think?

I have implemented a check in IDS to alert the user if there are discrepancies between the number of images on the server, compared to the number of unique filenames (FileName_New) in the database. Please see screenshot for example, but please note that in the screenshot, the duplicate image names are not all printed out properly, but this has been fixed to print all duplicated image names for the user to see/copy for further investigation. Also, the names of duplicate images are printed in the log file so that if the user closes the modalDialog window, they (when I say they, we all know I mean Shannon :P ) can always go back and find the image names for further database manipulations (without having to rerun the exporter).

S7_2014_error _Figure 1. modalDialog to alert users. The paragraph at the end has been changed to: 'Duplicate images will be renamed with '_duplicated_img' appended to the filename. However, this may cause complications further on. All databases should be managed by the Conservation Science team. Please check the discrepancy between filenames in the database (i.e. FileNameNew column in Trigger Table), and the filenames of the images on the server.'

Continuing with the sites which Shannon checked (thanks for finding these Shannon!), I quickly looked for these duplicate image filenames (in the databases) and below is a list of these for S7_2014-Badger_Honey, and S15_2015-Badger_Honey (I did S16_2016-Vehicles as well...and that was a mistake, way too many images for my internet - so won't be reporting on that site).

S7_2014 - Hony Badger - dtbs$FileName_new

  1. S7Station2Camera2CAM407792014-08-03__22-06-03.JPG (only 1 img on server)
  2. S7Station2Camera1CAM441802014-08-03__22-05-18.JPG (only 1 img on server)

S15_2015 - Hony Badger - dtbs$FileName_new

  1. S15Station40Camera1CAM414102015-11-01__21-41-54.JPG (only 1 img on server)
  2. S15Station40Camera1CAM414102015-11-04__02-17-04.JPG (only 1 img on server)
  3. S15Station40Camera2CAM415322015-11-04__02-17-04.JPG (only 1 img on server)
  4. S15Station40Camera2CAM415322015-11-06__03-56-55.JPG (only 1 img on server)
  5. S15Station40Camera1CAM414102015-11-04__03-22-48.JPG (only 1 img on server)
  6. S15Station40Camera2CAM415322015-11-04__03-22-48.JPG (only 1 img on server)
  7. S15Station40Camera1CAM414102015-11-06__03-56-55.JPG (only 1 img on server)

S16_2016 - Vehicle - dtbs$FileName_new (Note - Choosing vehicle takes for ever…)

  1. S16Station1Camera2CAM515192016-12-11__07-03-13.JPG (too many images for internet)
  2. S16Station35Camera1CAM521352016-12-11__08-35-47.JPG (too many imgs for internet)
  3. S16Station2Camera2CAM638012016-12-03__12-04-24.JPG (too many imgs for internet)
  4. S16Station38Camera2CAM516532016-12-11__07-13-47.JPG (too many imgs for internet)

All the above listed images have duplicates in the databases, however, on the server there are only one image for each. I have implemented a clause which will check for duplicate filenames and if TRUE, will rename them with the suffixed string: "_dplctd_img" (please see screenshot below). Thus, all the images in the database will be exported, and there wont be a discrepancy between number of database records and the number of actually exported images. The user will then have to check these themselves to determine whether they are actually duplicates or not.

S15_2015_dpctd Figure 2. Renamed duplicate images.

Further investigation lead me to find that in S15_2015 there are also images with incorrect filenames (please see screenshot below). There are 99 images on the server for S15_2015 that do not have the SiteID prefixed on the filename, i.e. images are named Station2__Camera1__2015-09-15__21-53-17(1).JPG, instead of S15_Station2__Camera1__2015-09-15__21-53-17(1).JPG. All 99 of these images seem to be duplicate images, suggested by the number in brackets at the end of the filename. Similarly, there are 53 such images for S7_2014 with this issue as well.

Screenshot 2020-09-10 at 08 23 18 (2) _Figure 3. S72014 incorrect filenames. This image is not a duplicate, only the filename is wrong on the server.

Screenshot 2020-09-10 at 07 48 23 (2) _Figure 4. S152015 incorrect filenames. None of these images are duplicates. There is a gap in the correctly named images on the server (window on the right), where these incorrectly named images (left window) should have been.

Perhaps we should implement a check within the data upload process to quickly scan the data and tell the user that some of the filenames are not correct. For example, we can str_split, take the first value of each string, and search for “S#_”. If a string doesn’t contain these characters, then IDS can alert them to it. Similarly, IDS can search for any other abnormal characters (e.g. brackets). Should we create these, and any other checks for data uploading?

The databases on the server should be checked by the Conservation Science/Leopard team for wrong image names (but they don't have access to view it on the server like we do). How can we shift these maintenance responsibilities back to them?

We need to get hold of the master HDD so that we can check whether there are any image duplicates which didn't make it onto the server. Shannon do you by chance have them? Or Lauren?

RossPitman commented 3 years ago

Hey @PhilipFaure thanks for this! Can we please change "Conservation Science" to "Data Science"?

Regarding incorrect image names – Aren't those green highlighted images the image names via to processing in IDS? That looks like the old format that CamTrapR used. These images may be up on the server, but there should at least be the correctly named images too. Are we saying that the correctly names images aren't there?

PhilipFaure commented 3 years ago

Thanks @RossPitman ,

Okay cool, I was under the impression the leopard team would have to fix their data. Will change the name to Data Science.

That's correct, there are no images for those images with the correct naming structure. In figure 4 you will notice that there is a gap in the server images between 09/15 and 09/22 which are all the images highlighted on the left (the wrongly named ones). I think that is the camtrapR format as well no that you point it out, since it contains station and camera but doesn't contain the CameraID number. camtrapR also does the brackets at the end for duplicate images if I remember correctly.

RossPitman commented 3 years ago

Yep, leopard team should fix their own data, not us. But it'll depend on their access. I'm not prepared to give them full access to the cloud.

Shannon-Dubay commented 3 years ago

Some quick responses to the following (italics symbolizing test from Phil's above post): The popup alert: _I have implemented a check in IDS to alert the user if there are discrepancies between the number of images on the server, compared to the number of unique filenames (FileNameNew) in the database. -> Doesn't this have to be checking for dependencies between the number of images on the server compared to the total number of rows within the database for the specified export? By using the number of unique filenames in the database, aren't you filtering out any duplicate filenames (rows with the same filename)? Or is this check just to check for missing server images (perhaps an upload didn't complete or something?)

Duplicate filenames vs images on the server: _Example: S15_2015 - Hony Badger - dtbs$FileName_new : S15Station40Camera1CAM414102015-11-0121-41-54.JPG (only 1 img on server) S15Station40Camera1CAM414102015-11-0402-17-04.JPG (only 1 img on server) S15Station40Camera2CAM415322015-11-0402-17-04.JPG (only 1 img on server) S15Station40Camera2CAM415322015-11-0603-56-55.JPG (only 1 img on server) S15Station40Camera1CAM414102015-11-0403-22-48.JPG (only 1 img on server) S15Station40Camera2CAM415322015-11-0403-22-48.JPG (only 1 img on server) S15Station40Camera1CAM414102015-11-06_03-56-55.JPG (only 1 img on server) -> Maybe I am just being a complete idiot- but hear me out. Computers don't allow for a file with the same file name to be placed in the same location, therefore there is no way that there could actually be 2 images with the same filename, since one will always write over the other. I am assuming this happened during the processing step, when the processed images get written to the computer/external hard drive. So there will always be only 1 image, even though there may be 2 rows in the database with the same filename.... I am assuming the other image was lost right in the beginning. Am I missing something here?

PhilipFaure commented 3 years ago

Thanks Shannon, you're correct. There can never be more than one unique filename. The modalDialog alert checks for the number of unique filenames, because this is what is causing the discrepancy, when IDS is to export images to the users computer, it will overwrite any filenames with the same name but these (like you mention in your second comment) can not exist in the same place, independent of each other, if they have the same name. Therefore, the check warns people that, look somewhere along the line, something happened, and now you have two records for the same image (which potentially could have been two images in the raw image data).

The root cause for the multiple records in the databases is most likely due to camerabase which collapses images into one, if they were taken at the same date-time. After camerabase, Panthera switched over to camtrapR shortly before switching over to PantheraR and now IDS. My guess is it happened somewhere in between switching from camerabase to camtrapR, since there are image names which corresponds to the naming structure of camtrapR in e.g. S15_2015.

In any case, users should now be able to see that IDS is not skipping any images, but instead the images which are "missing" are not on the server due to historic data management practices.

PhilipFaure commented 3 years ago

Closing issue as there is nothing wrong with the exporter in IDS.

A new issue for the possibly incorrect cameras active tables of old datasets, and wrong image names on server.

157

Shannon-Dubay commented 3 years ago

A responses to the following (bare with me, its important) :

The potential reason behind more than one row in the database having the same FileName_New: Phil (in above post): The root cause for the multiple records in the databases is most likely due to camerabase which collapses images into one, if they were taken at the same date-time. -> I have done some digging and believe that this all simple duplication of raw images. If this was an issue from camerabase, when digging into the raw images (original file names from the PantheraCam), one would expect to find a "burst" of images taken in very quick succession (maybe a leopard sniffing the camera, or a baboon playing with the camera)... the idea being that camerabase later collapses this into just one image. However, this is not what I find.

Due to limited access to raw images, I had to use another database for my example. S114_2017 (server-based) has a few instances where 2 rows have identical FileName_New, which you can see in the following screenshot (I have added the blue line to more easily differentiate between the instances).

Screenshot 2020-09-10 at 17 10 48


Interestingly, and suspiciously, these 10 rows (5 unique FileName_New) are all the data for this camera on this day according to the database. Upon navigating through the raw data to the correct camera on the correct day (pantherabucketidsrawarchive/S114_2017.../Original_Data/First/CAM41432/072117), it appears there are only 5 images taken by this camera on this day. I downloaded these images for easier interrogation: image

image

By looking at the Date Taken attribute, you can see that these raw images match up with the time specified in the FileName_New for these images perfectly. Again, I want to emphasize that these were the only images taken on this camera on this day, so there was no burst leading to images being deleted, nor were there other images that were "written over" by another image.

By looking at the TriggerID in the database screenshot, you can see that the Trigger ID skips suffix numbers, which I also found suspicious. There is no S114_20170711_20171004Station21CAM41432__38 (for example) in the database. What would cause these suffix numbers to skip like this, and perhaps this will give us some insight into why these duplicates occurred?

Shannon-Dubay commented 3 years ago

Also, I really don't feel that exporting duplicate images is the best idea. A user could very easily use one of the duplicate images in reco, or another analyses, without realizing it since most people don't interrogate the file names of their files.

If you feel strongly that the duplicate images need to be exported, then it needs to be very clear in the informational popup that this is being done, so that users are aware.

RossPitman commented 3 years ago

Folks if I recall correctly, we deliberately duplicated rows during the migration process since other software, like Camerabase, could only handle single records (or something like that). I can't remember the logic exactly. In any case, because IDS collapses records when creating independent records (for analyses) using DateTimeOriginal the duplication isn't a cause for concern.

Shannon-Dubay commented 3 years ago

OK. So if I understand correctly, some rows were duplicated during migration to account for "missing" captures that were previously consolidated in other software? I know these duplicated rows have very little impact on analyses or overall quality of data, but some users may argue that its important to include the correct "missing" images (that were previously threw out due to consolidating from software) so that the best image can be used for pattern recognition, for instance. As I understand it, these images are lost and we do not intend to relocate and fix this, correct? Would you mind drafting a quick little blurb that I can use to communicate this with any concerned users? (I don't want to say the wrong thing, as people can be sensitive about these topics)