janmtl / stratified_bayesian_blocks

Smart histogram bins for mixed discrete/continuous data, based on Scargle and Jake VDP's Bayesian Blocks
ISC License
2 stars 0 forks source link

alternate way to allow for multiple data points #1

Open jscargle opened 6 years ago

jscargle commented 6 years ago

I would like to learn more about your work. There have been some generalizations and developments in Bayesian Blocks that have not yet been published. At some point I plan to post this to github ... real soon.

I believe there is a simple way to incorporate multiple data points, depending on the cost function used ... for example as described in our paper Astrophysical Journal, 764:167 (26pp), 2013 STUDIES IN ASTRONOMICAL TIME SERIES ANALYSIS. VI. BAYESIAN BLOCK REPRESENTATIONS Jeffrey D. Scargle, Jay P. Norris, Brad Jackson, and James Chiang For example in the "workhorse" cost function for event data, we use [eq. (19)] N ( log N - log T ) for the cost of a block of length T containing N events. Duplicate values of the event times just get absorbed into this function, except possibly for some ambiguities in the definition of the beginning and ending times of the block (i.e. definition of T).

I wonder how this compares to your approach. Cheers, Jeff Scargle

janmtl commented 6 years ago

Hi @jscargle,

Great to hear from you! I'm looking forward to seeing the code for the new developments. I will have to read your paper in more detail to understand the improved method but I can provide some context for the code in this repo in the meantime.

When I did this work with my collaborators at ID Analytics, we were trying to find a better way to prepare features for a machine-learning algorithm. Specifically, we had received some data that was a mixture of continuous and discrete signals contained in the same column of a given table. The values we observed were effectively sampled from two distributions: (1) a "nice" continuous distribution for which Bayesian Blocks gave a nice density estimate and (2) a discrete distribution of unknown support (i.e.: we did not know a priori the possible values of this distribution).

We tried at first to separate out the two distributions through heuristics (by using unique value counts) but we then discovered that the discrete distribution actually wasn't actually perfectly "discrete". Each discrete "spike" was actually slightly noisy and constituted a significant volume of the total density.

We developed the method in this repo through trial and error. Our criteria were (A) that the "spikes" in the input data were retained in the block representation and (B) that our ML algorithms saw an improvement from features pre-processed using this method. In the end, we achieved both (A) and (B) and saw a moderate lift from being able to capture both the discrete and continuous nature of the data in the column in question.

I would it find difficult to show that this method is useful without also publishing the dataset on which it performed so well but as I'm sure you can understand, there is no way for us to have published credit fraud data. Sadly, this ties my hands since I do not have a suitable dataset on which to perform a reproducible study.

Cheers,

Jan

jscargle commented 6 years ago

Hi Jan ... thanks for the reply and this very clear description of your problem.

If I understand correctly, a perhaps oversimplification is:

do an optimal block segmentation and then try to separate two components:

(1) short blocks (the discrete spikes) (2) long blocks (the continuous distribution)

and hope that the distributions of the block lengths for (1) and (2) do not overlap (too much).

I have not been able to think of any astronomical data of this kind. But a while ago I worked with some genetic sequence data that might come close. It was a search for "GC islands" (intervals rich in G and C, and poor in A and T).

Being rather ignorant of this science, I was stunned to find a clearly defined discrete component in the distribution of the block lengths --- namely integer multiples of 144 base pairs (I may be misremembering the exact number, but it was ~144); See the attached density plots of the GC richness (fraction, between 0 and 1) and log10 of the block length. The second plot "guides" you eye to the first six multiples of 144. I then learned that this is sort of a magic number, equal to the number of base pairs in one turn of the double helix! (This has not yet been published.)

Note that the 1D distribution of block lengths is a set of spikes superimposed on a noisy continuum. If this interests you I can provide more details. The BB runs were pretty long ... N in the several hundred millions, and here pruning was important.

Cheers, Jeff

Jeffrey D. Scargle Space Science Division Mail Stop 245-3 NASA Ames Research Center Moffett Field, CA 94035-0001

Phone: 650 604 6330 Cell: 415 385 4297 Fax: 650 604 6779


From: Jan Florjanczyk [notifications@github.com] Sent: Tuesday, July 10, 2018 10:22 AM To: janmtl/stratified_bayesian_blocks Cc: Scargle, Jeffrey D. (ARC-SST); Mention Subject: Re: [janmtl/stratified_bayesian_blocks] alternate way to allow for multiple data points (#1)

Hi @jscarglehttps://github.com/jscargle,

Great to hear from you! I'm looking forward to seeing the code for the new developments. I will have to read your paper in more detail to understand the improved method but I can provide some context for the code in this repo in the meantime.

When I did this work with my collaborators at ID Analytics, we were trying to find a better way to prepare features for a machine-learning algorithm. Specifically, we had received some data that was a mixture of continuous and discrete signals contained in the same column of a given table. The values we observed were effectively sampled from two distributions: (1) a "nice" continuous distribution for which Bayesian Blocks gave a nice density estimate and (2) a discrete distribution of unknown support (i.e.: we did not know a priori the possible values of this distribution).

We tried at first to separate out the two distributions through heuristics (by using unique value counts) but we then discovered that the discrete distribution actually wasn't actually perfectly "discrete". It was more akin to the example I give herehttps://medium.com/@janplus/stratified-bayesian-blocks-2bd77c1e6cc7, i.e.: each discrete "spike" was actually slightly noisy and constituted a significant volume of the total density. Where can one get such unusual data? (I hear you ask) – well we were working in credit and identity fraud so this column was likely the result of a lot of other variable pieced together through some sort of "index" formula created by a third-party vendor.

We developed the method in this repo through trial and error. Our criteria were (A) that the "spikes" in the input data were retained in the block representation and (B) that our ML algorithms saw an improvement from features pre-processed using this method. In the end, we achieved both (A) and (B) and saw a moderate lift from being able to capture both the discrete and continuous nature of the data in the column in question.

I would it find difficult to show that this method is useful without also publishing the dataset on which it performed so well but as I'm sure you can understand, there is no way for us to have published credit fraud data. Sadly, ties my hands since I do not have a suitable dataset on which to perform a reproducible study.

Cheers,

Jan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/janmtl/stratified_bayesian_blocks/issues/1#issuecomment-403901484, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABo4ApzN51bs8cmoVzAxtY7adiRdQ2XBks5uFOLQgaJpZM4VEqht.