michaelaye / planet4

Analysis software for the PlanetFour citizen science project.
www.planetfour.org
ISC License
2 stars 3 forks source link

Duplicate entries in the 0.5 cut catalog #53

Open mschwamb opened 8 years ago

mschwamb commented 8 years ago

It appears that are duplicate entries at least in the 0.5 cut csv files in some cases..

For example I have:

488.92527796427413,283.42631325721743,2277.2586112976073,6649.35964659055,209.19709872152092,44.129834465313714,128.01835043922821,391.3928304362869,271.22236959730355,427.41745758605964,206.75622395261806,APF0000q7s,1

listed twice in ESP_021494_0945_fans.csv under applied_cut_0.5

178.03871848366478,291.06212165138936,918.0387184836648,11251.062121651392,51.881296109111894,13.286515110774953,13.777289639625213,186.2403877064726,304.335176078663,169.83704926085696,277.78906722411574,167.19966246744863,299.56674242432524,188.87777449988093,282.5575008784535,APF0000kou,1

is listed twice in APF0000kou ESP_021526_0985_blotches.csv

It doesn't look like every entry is duplicated so I'm not sure what exactly happened here.

michaelaye commented 8 years ago

Wow, started to look at this yesterday, thanks for the find, something is seriously wrong with how I apply the cut. Fortunately, the fnotching itself is okay, as evident by the fact that the files outside the applied cut don't seem to show any duplicates (I ran a check on all of them). But here's the stats on the cut-applied files: unknown-1 Meaning, while most have no dupes, there's more than 10 files that have between 1000 and 2000 dupes! Looking into this now.

mschwamb commented 8 years ago

I think this explains why I see the catalog has a significant number more fans and blotches than were marked by the science team for the gold standard data and might explain the variation Anya saw for two images taken at very close temporal separations.

michaelaye commented 8 years ago

So, this lead down to a rabbit hole, but I'm seeing the end of it:

michaelaye commented 8 years ago

So, the remaining dupes after fixing above mentioned bug is this: Out of 439 blotch and fan files for season 2 and 3, 172 show duplicates, with the ones having more than 20 looking like this:

screenshot 2016-06-02 09 59 20

Anya will run this new catalog today through her scripts to see if so far it had any influence on the variability of the early in the season data points.

michaelaye commented 8 years ago

Within that highest obsid, the distribution of the top duplicate containers is like this:

screenshot 2016-06-02 10 36 09