open-city / water

Projects for the water team
3 stars 2 forks source link

Try neural network for CSO forecasting #9

Open amfrandolph opened 9 years ago

amfrandolph commented 9 years ago

This was a suggestion from Seb.

"Zane's R code is using a linear combination of many variables. yhat=b0 + b1_x1+ b2_x2 + ... bp*xp

But the learner could very well be of the form yhat=b0 + b1_x1^c1+ b2_x2^c2 + ... bp*xp^c3 where c1, c2, and c3 are exponents> 1.

I understand that in our case we have many inputs (the x_i, precipitation values for each month at each site) and that the output y is the number of overflows (CSOs).

I think that maybe a neural network would be better at the problem of CSO forecasting since it can presumably explore a more diverse landscape of behaviors (such as the non-linear ones).

sebhtml commented 9 years ago

I will try Torch:

I believe that it is programmable in Lua.

I will first go through this example: https://github.com/nicholas-leonard/dp/blob/master/doc/neuralnetworktutorial.md

Also, I will figure out the input format required by Torch and then I will convert the data in SewageModel/data/ to the required format (so-called Data Janitor Task).

sebhtml commented 9 years ago

Indicators of progress:

sebhtml commented 9 years ago

@zscore Both files (munged_data.RDS and transformed_precip.RDS) have 67912 lines. The line i in the first file is paired with the line i in the second file, right ?

sebhtml commented 9 years ago

@zscore In munged_data.RDS, the columns starting at "segment_1" until "Wilmette DS-M114N-2" are names of places where CSO can occur, right ? 0 means normal and 1 means overflow. Is that correct ?

@amfrandolph also, in transformed_precip.RDS there are columns with similar names. For example: ord_precip_1, ord_precip_2, ord_precip_97, and so on. I suppose that "ord" is for the airport. What is the meaning of the number at the end (1, 2, 97, and so on) ?

The input values contains precipitation values (67912 examples). The output values are the sewage overflows (segment_* or other stranger names).

amfrandolph commented 9 years ago

@seb I don't have the answer to question about the column names. Scott may, or can give us lead to who originally wrote the code. I would be glad to help write code book that defines our variables, as we keep learning.

On Sat, Jan 24, 2015 at 4:31 PM, Sébastien Boisvert < notifications@github.com> wrote:

@zscore https://github.com/zscore In munged_data.RDS, the columns starting at "segment_1" until "Wilmette DS-M114N-2" are names of places where CSO can occur, right ? 0 means normal and 1 means overflow. Is that correct ?

@amfrandolph https://github.com/amfrandolph also, in transformed_precip.RDS there are columns with similar names. For example: ord_precip_1, ord_precip_2, ord_precip_97, and so on. I suppose that "ord" is for the airport. What is the meaning of the number at the end (1, 2, 97, and so on) ?

The input values contains precipitation values (67912 examples). The output values are the sewage overflows (segment_* or other stranger names).

— Reply to this email directly or view it on GitHub https://github.com/open-city/water/issues/9#issuecomment-71340872.

sebhtml commented 9 years ago

Zane said that he was able to make 'glm' converge by using a lower number of predictors (he said that there were issues when there are too many correlated predictors).

In the paper "Hydrologic and Hydraulic Modeling of the Tunnel and Reservoir Plan.pdf", they focused on dropshaft CDS-51.