hosseinmoein / DataFrame

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
https://hosseinmoein.github.io/DataFrame/
BSD 3-Clause "New" or "Revised" License
2.41k stars 306 forks source link

How to Use .apply with special function? #203

Closed z6833 closed 1 year ago

z6833 commented 1 year ago

I want to implement the function like .apply in pandas. For instance. It would like this dataframe.apply(func, axis=...) in pandas . Maybe custom visitor with function visit() can achieve it , but I just do not know how to implement my special visitor ? Could you give me some advice ?

hosseinmoein commented 1 year ago

Yes, in DataFrame the apply functionality is achieved by the visitor call

First read the documentation, especially the visitors section here: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html Also, read: https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/visit.html https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/single_act_visit.html

To see visitor examples look at this source file https://github.com/hosseinmoein/DataFrame/blob/master/include/DataFrame/DataFrameStatsVisitors.h

Basically, visitors currently can operate on up to 5 columns at the same time. There are two kinds of visitors; regular visitors and single action visitors. You probably want to follow the single action visitors paradigm. They are all explained in the docs and examples.

z6833 commented 1 year ago

Thanks so much . I can understand the examples, but I still can not write my custom visitor for my poor C++ skills . I am new here . -----------------Another question:

I used get_data by passing columns with two methods to get data like dataframe[["a", "b", "c"]] in pandas : first,

vector<const char *> cols;
cols = {"a", "b", "c"};
get_data<string, int>(cols);

then I can get the data properly; but when I passed by:

vector<const char *> cols;
auto columns = XXX.get_columns_info<string, int>()
for (auto c: columns)
{
   cols.push_back(get<0>(c).c_str());
}

and I can not get the data frame . It is puzzling

hosseinmoein commented 1 year ago

Regarding the visitors, you just have to follow the docs above and look at this example https://github.com/hosseinmoein/DataFrame/blob/master/include/DataFrame/DataFrameStatsVisitors.h#L1778 Just follow its interface. Parts of the interface is in #defines such as DEFINE_PRE_POST and DEFINE_RESULT

Regarding the get_data, I am not sure I follow your issue. I suggest reading the docs with code samples here https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/get_data.html

z6833 commented 1 year ago

sorry to bother you again, I 've met a question puzzling .when I use multi threads to do works with code:

vector<thread> threads;
SpinLock lock;
MyFrame::set_lock(&lock);
for (auto i: iiii)
{
     for (auto j: jjjj)
      {
             threads.push_back( thread(do_work, param0, param1);
      }
}
for (int i = 0; i < threads.size(); ++i)
     threads[i].join();
MyFrame::remove_lock();

It is slowly more 100 times than using a single thread with

for (auto i: iiii)
{
     for (auto j: jjjj)
      {
             do_work(param0, param1);
      }
}

and do_work function contains load_data and other operations like dataframe_tester.cc's do_work . I just wrote the code like test_thread_safety . I do not know why .

hosseinmoein commented 1 year ago

That is not surprising at all. Running with threads adds another loop on the top of the already double loop logic. The purpose of _dataframe_threadsafety.cc is not to show it is faster but to show that the DataFrame thread safety machanism works.

Threading is not always a faster way to go. It depends on what you are doing, your software and hardware configuration, ... Blindly creating 100 threads on a two core machine will choke the performance for sure.

z6833 commented 1 year ago

I tried to use ThreadPool to control number of threads , but it seemed not working. with code


SpinLock lock;
ThreadPool pool(4)
MyFrame::set_lock(&lock);
for (auto i: iiii)
{
     for (auto j: jjjj)
      {
             pool.enquequ(do_task, ..., ...)
      }
}
MyFrame::remove_lock();

ThreadPool.h in https://github.com/progschj/ThreadPool .

and I have tried to use set_thread_level with MyFrame, it does not seems to make any difference .

And I tried to limit the thread nums in the with


void do_task()
{
    int cores = 4;

    if (processor > cores){
        unique_lock<mutex> lck(m);
        while (processor > cores) {
            condition.wait(lck);
        }
    }
    processor += 1;

    do_something(  );

    processor -= 1;
    condition.notify_all();
}
``` in the follow ways . It abort with exit code 139 .  Is there any more examples with multithreads in dataframe ?

> ```c++
> vector<thread> threads;
> SpinLock lock;
> MyFrame::set_lock(&lock);
> for (auto i: iiii)
> {
>      for (auto j: jjjj)
>       {
>              threads.push_back( thread(do_work, param0, param1);
>       }
> }
> for (int i = 0; i < threads.size(); ++i)
>      threads[i].join();
> MyFrame::remove_lock();
> ```
hosseinmoein commented 1 year ago

It seems to me you do not fully grasp the concept of multithreaded computing. Although, multithreading is very common these days, to fully take advantage of it requires advance experience in software engineering and computer science. I cannot teach you that here. All I can tell you is anecdotal advice:

  1. Not every problem tends itself to be solved through multiple threads efficiently
  2. As I said above, multithreading is tightly related to your software (OS) and hardware setup
  3. As I said above, the examples in DataFrame are not good examples of how to solve a problem through multiple threads. They are there to unit test the spinlock protection mechanism. For example in _dataframe_threadsafety.cc, when you use multiple threads you are making the problem that much bigger. If you use 3 threads, you have made the problem 3x bigger. If you use 100 threads, you have made the problem 100x bigger. So, of course they will be slow
  4. _set_threadlevel() only applies to certain algorithms in DataFrame. It doesn't do anything in general use case.
  5. And finally and most importantly, you are going about this the wrong way. First, study software engineering and computer science Second, identify a problem you want to solve Third, determine if the problem you want to solve can be solved more efficiently through multiple threads. The majority of problems are more efficiently solved through a single thread Fourth, carefully divide your problem into multiple independent threads that require minimum synchronization. Fifth, check your hardware and software setup to determine the optimal number of threads
z6833 commented 1 year ago

thanks for your kind advices sincerely . Yes, it is high performance for sing-thread, and it can achieve the goal for performance . But I have a business scenario that someone may call my function do_work by using a sub thread . So I have to process the scenario that my function do_work with dataframe operations will not abort when it is called by a sub thread .

hosseinmoein commented 1 year ago

In that case the most efficient way if not to use set_lock(). Just have a global mutex that protects do_work()