Duplicate entries in the 0.5 cut catalog

mschwamb commented 8 years ago

It appears that are duplicate entries at least in the 0.5 cut csv files in some cases..

For example I have:

488.92527796427413,283.42631325721743,2277.2586112976073,6649.35964659055,209.19709872152092,44.129834465313714,128.01835043922821,391.3928304362869,271.22236959730355,427.41745758605964,206.75622395261806,APF0000q7s,1

listed twice in ESP_021494_0945_fans.csv under applied_cut_0.5

178.03871848366478,291.06212165138936,918.0387184836648,11251.062121651392,51.881296109111894,13.286515110774953,13.777289639625213,186.2403877064726,304.335176078663,169.83704926085696,277.78906722411574,167.19966246744863,299.56674242432524,188.87777449988093,282.5575008784535,APF0000kou,1

is listed twice in APF0000kou ESP_021526_0985_blotches.csv

It doesn't look like every entry is duplicated so I'm not sure what exactly happened here.

michaelaye commented 8 years ago

Wow, started to look at this yesterday, thanks for the find, something is seriously wrong with how I apply the cut. Fortunately, the fnotching itself is okay, as evident by the fact that the files outside the applied cut don't seem to show any duplicates (I ran a check on all of them). But here's the stats on the cut-applied files: unknown-1 Meaning, while most have no dupes, there's more than 10 files that have between 1000 and 2000 dupes! Looking into this now.

mschwamb commented 8 years ago

I think this explains why I see the catalog has a significant number more fans and blotches than were marked by the science team for the gold standard data and might explain the variation Anya saw for two images taken at very close temporal separations.

michaelaye commented 8 years ago

So, this lead down to a rabbit hole, but I'm seeing the end of it:

First, I found a not insignificant bug within the pandas library with on-disk filtering of data columns that were stored as pandas.Categories (reported at pydata/pandas#13322). I am using those instead of just strings for image_id and image_name because the predetermined length of the strings saves currently around 2 GB of disk space for the reduced classification data-base. Not a biggie, apart from when I want to copy around the database file often. For now, worked around by using again strings for image_id and _name.
Then, importantly, I found a lingering bug in my fnotching code, that was introduced when we realized that it is helpful to keep both the planet4 tile coordinates of markings and the hirise image coordinates alive through-out the clustering process so that one can refer one to the other later for plotting. I previously did brutally kick out what I did not needed, and renamed image_x and image_y to x and y if clustering on hirise scope. This meant, when I removed the renaming and kept both image_x/y and x/y, the fnotching would use the planet4 x/y coords, even so the pipeline was working on the hirise scope. This lead to the sometimes thousands of duplicates, because the fnotching code, when presented with all data for a whole Hirise image, found of course many overlying clusters, while only looking at p4 tile coordinates, as there are many p4 tiles in an Hirise image. Fixed that by now always requiring a scope argument that tells at all times in what scope i'm working in (planet4 or hirise) and uses then the appropriate data columns, without losing any.
Finally, it became clear, that after repairing the previous bullet, there are still some duplicates remaining. I am reproducing a catalog now to assess the status after the bigger bugfix.

michaelaye commented 8 years ago

So, the remaining dupes after fixing above mentioned bug is this: Out of 439 blotch and fan files for season 2 and 3, 172 show duplicates, with the ones having more than 20 looking like this:

Anya will run this new catalog today through her scripts to see if so far it had any influence on the variability of the early in the season data points.

michaelaye commented 8 years ago

Within that highest obsid, the distribution of the top duplicate containers is like this:

michaelaye / planet4

Duplicate entries in the 0.5 cut catalog #53