vlang / vsl

V library to develop Artificial Intelligence and High-Performance Scientific Computations
https://vlang.github.io/vsl
MIT License
339 stars 46 forks source link

vsl.ml: Random Forest #126

Open ulises-jeremias opened 1 year ago

ulises-jeremias commented 1 year ago

Describe the feature

We want to create a new model on vsl.ml to do classification using the Random Forest algorithm. That model should follow the following interfaces:

[heap]
pub struct RandomForest {
mut:
    name       string     // name of this "observer"
    data       &Data[f64] // x data
    stat       &Stat[f64] // statistics about x (data)
        min_samples_split int
        max_depth int
}

With the following methods

        name() string
mut:
        update() // called by Data when it changes
        train()
        predict(x [][]f64) []f64

Use Case

-

Proposed Solution

-

Other Information

-

Acknowledgements

Version used

-

Environment details (OS name and version, etc.)

-

BMJHayward commented 1 year ago

I've been making some progress on this in my own fork. Not sure how to "share" the data object between the random forest &Data[f64] and individual trees yet though.

ulises-jeremias commented 1 year ago

@BMJHayward feel free to send me a Draft Pull Request or the link to the branch where you have your changes so I can suggest how to do it 😊

BMJHayward commented 1 year ago

Thanks @ulises-jeremias here's my current branch, not quite ready for PR: https://github.com/BMJHayward/vsl/tree/126_implement_random_forest

I thought to make a proper decision tree implementation and use that in the RF as well, but using the Data interface makes it tricky as I can't just use Data.x and Data.y. Or can I? Each tree uses several and different indexes.

ulises-jeremias commented 1 year ago

yeah, I think it is not enough. The Data.x and Data.y should be used only to replace the data you were receiving here. You will probably need to have another struct, or multiple instances of Data, but probably this last is lest efficient

ulises-jeremias commented 1 year ago

I just updated latest master adding some methods.

there are two ways you can do this now:

mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?

or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone() and then setting y with the new index

mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?
ulises-jeremias commented 1 year ago

@BMJHayward ^^

BMJHayward commented 1 year ago

I just updated latest master adding some methods.

there are two ways you can do this now:

* creating a new instance of data and sharing the reference to `x` and the new `y`
mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?

or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone() and then setting y with the new index

mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?

excellent thankyou, I'll take a look over the weekend

ulises-jeremias commented 1 year ago

@BMJHayward hey! did that work? is there anything else I can do to help?

BMJHayward commented 1 year ago

@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3] and it would be perfectly legitimate to use them twitce.

I can't figure out how to do this yet and maintain a consistent interface using Data like the rest of VSL. I'm sure there's a good way, and maybe calling set_y multiple times for each tree will be ok.

I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.

dumblob commented 1 year ago

Was just looking for Cox Regression and Random Forest in VSL which brought me here.

I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .

Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).

Any plans for such "stop & resume" API?

Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK

ulises-jeremias commented 1 year ago

@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3] and it would be perfectly legitimate to use them twitce.

I can't figure out how to do this yet and maintain a consistent interface using Data like the rest of VSL. I'm sure there's a good way, and maybe calling set_y multiple times for each tree will be ok.

I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.

hey! don't rush with it. Family is more important 😊

About the question, I think calling set_y multiple times is OK as soon as the .clone() method is used 👌🏻

ulises-jeremias commented 1 year ago

Was just looking for Cox Regression and Random Forest in VSL which brought me here.

I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .

Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).

Any plans for such "stop & resume" API?

Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK

lazypredict is great! we will probably add more models during time 👌🏻

regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations

dumblob commented 1 year ago

regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations

Yep, .h5 is fine. Maybe to not slow down the computation we could just fork the process (i.e. delegate COW of all the structs with data to the operating system as e.g. Redis does) so takes a negligible time and then save it to disk. The data might have easily hundreds of MB or more, so not doing it fully in parallel could slow down the computation too much (and V's threading support is probably not enough as it would involve memcpy() which would be definitely much slower than COW over pages the operating systems maintains under the hood). Just a thought.

dumblob commented 1 year ago

I wonder if there is any news regarding Cox Regression, Random Forest, and .h5 checkpointing. I could not find anything in the commits.

But no pressure, I just want to regularly get up to date :wink:.

dumblob commented 11 months ago

Any news? Especially the checkpointing seems highly beneficial to everybody (compared to Cox Regression and Random Forest).

dumblob commented 4 months ago

Still interested in this to allow me start recommending V (VSL) within my bubble :wink:.