Closed z6833 closed 1 year ago
Yes, in DataFrame the apply functionality is achieved by the visitor call
First read the documentation, especially the visitors section here: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html Also, read: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/visit.html https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/single_act_visit.html
To see visitor examples look at this source file https://github.com/hosseinmoein/DataFrame/blob/master/include/DataFrame/DataFrameStatsVisitors.h
Basically, visitors currently can operate on up to 5 columns at the same time. There are two kinds of visitors; regular visitors and single action visitors. You probably want to follow the single action visitors paradigm. They are all explained in the docs and examples.
Thanks so much . I can understand the examples, but I still can not write my custom visitor
for my poor C++ skills . I am new here .
-----------------Another question:
I used get_data
by passing columns
with two methods to get data like dataframe[["a", "b", "c"]]
in pandas :
first,
vector<const char *> cols;
cols = {"a", "b", "c"};
get_data<string, int>(cols);
then I can get the data properly; but when I passed by:
vector<const char *> cols;
auto columns = XXX.get_columns_info<string, int>()
for (auto c: columns)
{
cols.push_back(get<0>(c).c_str());
}
and I can not get the data frame . It is puzzling
Regarding the visitors, you just have to follow the docs above and look at this example
https://github.com/hosseinmoein/DataFrame/blob/master/include/DataFrame/DataFrameStatsVisitors.h#L1778
Just follow its interface. Parts of the interface is in #defines such as DEFINE_PRE_POST
and DEFINE_RESULT
Regarding the get_data, I am not sure I follow your issue. I suggest reading the docs with code samples here https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/get_data.html
sorry to bother you again, I 've met a question puzzling .when I use multi threads to do works with code:
vector<thread> threads;
SpinLock lock;
MyFrame::set_lock(&lock);
for (auto i: iiii)
{
for (auto j: jjjj)
{
threads.push_back( thread(do_work, param0, param1);
}
}
for (int i = 0; i < threads.size(); ++i)
threads[i].join();
MyFrame::remove_lock();
It is slowly more 100 times than using a single thread with
for (auto i: iiii)
{
for (auto j: jjjj)
{
do_work(param0, param1);
}
}
and do_work
function contains load_data
and other operations like dataframe_tester.cc
's do_work
. I just wrote the code like test_thread_safety
. I do not know why .
That is not surprising at all. Running with threads adds another loop on the top of the already double loop logic. The purpose of _dataframe_threadsafety.cc is not to show it is faster but to show that the DataFrame thread safety machanism works.
Threading is not always a faster way to go. It depends on what you are doing, your software and hardware configuration, ... Blindly creating 100 threads on a two core machine will choke the performance for sure.
I tried to use ThreadPool
to control number of threads , but it seemed not working. with code
SpinLock lock;
ThreadPool pool(4)
MyFrame::set_lock(&lock);
for (auto i: iiii)
{
for (auto j: jjjj)
{
pool.enquequ(do_task, ..., ...)
}
}
MyFrame::remove_lock();
ThreadPool.h
in https://github.com/progschj/ThreadPool .
and I have tried to use set_thread_level
with MyFrame
, it does not seems to make any difference .
And I tried to limit the thread nums in the with
void do_task()
{
int cores = 4;
if (processor > cores){
unique_lock<mutex> lck(m);
while (processor > cores) {
condition.wait(lck);
}
}
processor += 1;
do_something( );
processor -= 1;
condition.notify_all();
}
``` in the follow ways . It abort with exit code 139 . Is there any more examples with multithreads in dataframe ?
> ```c++
> vector<thread> threads;
> SpinLock lock;
> MyFrame::set_lock(&lock);
> for (auto i: iiii)
> {
> for (auto j: jjjj)
> {
> threads.push_back( thread(do_work, param0, param1);
> }
> }
> for (int i = 0; i < threads.size(); ++i)
> threads[i].join();
> MyFrame::remove_lock();
> ```
It seems to me you do not fully grasp the concept of multithreaded computing. Although, multithreading is very common these days, to fully take advantage of it requires advance experience in software engineering and computer science. I cannot teach you that here. All I can tell you is anecdotal advice:
thanks for your kind advices sincerely . Yes, it is high performance for sing-thread, and it can achieve the goal for performance . But I have a business scenario that someone may call my function do_work
by using a sub thread . So I have to process the scenario that my function do_work
with dataframe operations will not abort when it is called by a sub thread .
In that case the most efficient way if not to use set_lock()
.
Just have a global mutex that protects do_work()
I want to implement the function like
.apply
in pandas. For instance. It would like thisdataframe.apply(func, axis=...)
in pandas . Maybe customvisitor
with functionvisit()
can achieve it , but I just do not know how to implement my specialvisitor
? Could you give me some advice ?