Open ulises-jeremias opened 1 year ago
I've been making some progress on this in my own fork. Not sure how to "share" the data object between the random forest &Data[f64] and individual trees yet though.
@BMJHayward feel free to send me a Draft Pull Request or the link to the branch where you have your changes so I can suggest how to do it 😊
Thanks @ulises-jeremias here's my current branch, not quite ready for PR: https://github.com/BMJHayward/vsl/tree/126_implement_random_forest
I thought to make a proper decision tree implementation and use that in the RF as well, but using the Data
interface makes it tricky as I can't just use Data.x
and Data.y
. Or can I? Each tree uses several and different indexes.
yeah, I think it is not enough. The Data.x
and Data.y
should be used only to replace the data you were receiving here. You will probably need to have another struct, or multiple instances of Data
, but probably this last is lest efficient
I just updated latest master adding some methods.
there are two ways you can do this now:
x
and the new y
mut new_data := data.clone_with_same_x()
new_data.set_y(new_index_y)?
or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing data.clone()
and then setting y
with the new index
mut data_with_new_index := data.clone()
data_with_new_index.set_y(new_index_y)?
@BMJHayward ^^
I just updated latest master adding some methods.
there are two ways you can do this now:
* creating a new instance of data and sharing the reference to `x` and the new `y`
mut new_data := data.clone_with_same_x() new_data.set_y(new_index_y)?
or if you want X to be a new instance of la.Matrix, you can have multiple instances of Data doing
data.clone()
and then settingy
with the new indexmut data_with_new_index := data.clone() data_with_new_index.set_y(new_index_y)?
excellent thankyou, I'll take a look over the weekend
@BMJHayward hey! did that work? is there anything else I can do to help?
@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in ml.tree.grow_tree
. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. The rand
module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns [1,2,3,1,2,3]
and it would be perfectly legitimate to use them twitce.
I can't figure out how to do this yet and maintain a consistent interface using Data
like the rest of VSL. I'm sure there's a good way, and maybe calling set_y
multiple times for each tree will be ok.
I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.
Was just looking for Cox Regression and Random Forest in VSL which brought me here.
I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .
Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).
Any plans for such "stop & resume" API?
Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK
@ulises-jeremias hi thanks for following up on this. The lynchpin for me is in
ml.tree.grow_tree
. In line 155 the tree is "grown" by randomly selecting columns, or, just their index, and splitting based on them. Therand
module samples without replacement, so I think I need to an option there to sample with replacement. So i.e. a tree can use columns[1,2,3,1,2,3]
and it would be perfectly legitimate to use them twitce.I can't figure out how to do this yet and maintain a consistent interface using
Data
like the rest of VSL. I'm sure there's a good way, and maybe callingset_y
multiple times for each tree will be ok.I'm also busy with family and renovations on the house at the moment, it might be better if someone takes this on and I can consult or something. I'm happy to do it, it just won't be quick.
hey! don't rush with it. Family is more important 😊
About the question, I think calling set_y multiple times is OK as soon as the .clone() method is used 👌🏻
Was just looking for Cox Regression and Random Forest in VSL which brought me here.
I wonder if there are any plans for Cox R. and perhaps a few other from https://github.com/shankarpandala/lazypredict .
Also it seems VSL so far does not support "stop & resume" operation acutely needed for fully automated "checkpointing to HDD & recovery from HDD" in long-running apps (which often fail due to full memory, stall, etc. and need to be restarted paying the tens of hours of identical computation again and again...).
Any plans for such "stop & resume" API?
Of course, it has to be weighted against performance, so maybe it could be tied to time - every approx 10 seconds by default the computation will be interrupted and saved to a user-defined location. IDK
lazypredict is great! we will probably add more models during time 👌🏻
regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations
regarding the checkpointing, I didnt thought about it. We can probably add it in the near future. Will think about it and try to figure out a best way to do it. Probably creating .h5 files on some iterations
Yep, .h5 is fine. Maybe to not slow down the computation we could just fork the process (i.e. delegate COW of all the structs with data to the operating system as e.g. Redis does) so takes a negligible time and then save it to disk. The data might have easily hundreds of MB or more, so not doing it fully in parallel could slow down the computation too much (and V's threading support is probably not enough as it would involve memcpy()
which would be definitely much slower than COW over pages the operating systems maintains under the hood). Just a thought.
I wonder if there is any news regarding Cox Regression, Random Forest, and .h5 checkpointing. I could not find anything in the commits.
But no pressure, I just want to regularly get up to date :wink:.
Any news? Especially the checkpointing seems highly beneficial to everybody (compared to Cox Regression and Random Forest).
Still interested in this to allow me start recommending V (VSL) within my bubble :wink:.
Describe the feature
We want to create a new model on
vsl.ml
to do classification using the Random Forest algorithm. That model should follow the following interfaces:With the following methods
Use Case
-
Proposed Solution
-
Other Information
-
Acknowledgements
Version used
-
Environment details (OS name and version, etc.)
-