Open mschwamb opened 8 years ago
Wow, started to look at this yesterday, thanks for the find, something is seriously wrong with how I apply the cut. Fortunately, the fnotching itself is okay, as evident by the fact that the files outside the applied cut don't seem to show any duplicates (I ran a check on all of them). But here's the stats on the cut-applied files: Meaning, while most have no dupes, there's more than 10 files that have between 1000 and 2000 dupes! Looking into this now.
I think this explains why I see the catalog has a significant number more fans and blotches than were marked by the science team for the gold standard data and might explain the variation Anya saw for two images taken at very close temporal separations.
So, this lead down to a rabbit hole, but I'm seeing the end of it:
image_id
and image_name
because the predetermined length of the strings saves currently around 2 GB of disk space for the reduced classification data-base. Not a biggie, apart from when I want to copy around the database file often. For now, worked around by using again strings for image_id and _name.image_x
and image_y
to x
and y
if clustering on hirise scope. This meant, when I removed the renaming and kept both image_x/y and x/y, the fnotching would use the planet4 x/y coords, even so the pipeline was working on the hirise scope. This lead to the sometimes thousands of duplicates, because the fnotching code, when presented with all data for a whole Hirise image, found of course many overlying clusters, while only looking at p4 tile coordinates, as there are many p4 tiles in an Hirise image. Fixed that by now always requiring a scope argument that tells at all times in what scope i'm working in (planet4 or hirise) and uses then the appropriate data columns, without losing any.So, the remaining dupes after fixing above mentioned bug is this: Out of 439 blotch and fan files for season 2 and 3, 172 show duplicates, with the ones having more than 20 looking like this:
Anya will run this new catalog today through her scripts to see if so far it had any influence on the variability of the early in the season data points.
Within that highest obsid, the distribution of the top duplicate containers is like this:
It appears that are duplicate entries at least in the 0.5 cut csv files in some cases..
For example I have:
488.92527796427413,283.42631325721743,2277.2586112976073,6649.35964659055,209.19709872152092,44.129834465313714,128.01835043922821,391.3928304362869,271.22236959730355,427.41745758605964,206.75622395261806,APF0000q7s,1
listed twice in ESP_021494_0945_fans.csv under applied_cut_0.5
178.03871848366478,291.06212165138936,918.0387184836648,11251.062121651392,51.881296109111894,13.286515110774953,13.777289639625213,186.2403877064726,304.335176078663,169.83704926085696,277.78906722411574,167.19966246744863,299.56674242432524,188.87777449988093,282.5575008784535,APF0000kou,1
is listed twice in APF0000kou ESP_021526_0985_blotches.csv
It doesn't look like every entry is duplicated so I'm not sure what exactly happened here.