Does this cause core dump ?

chenglin commented 2 years ago

Recently, I find that one of my model will cause core dump if I use lleaves for predict.

I am confused about two functions below.

In codegen.py, function param type can be int* if param is categorical

def make_tree(tree):
    # declare the function for this tree
    func_dtypes = (INT_CAT if f.is_categorical else DOUBLE for f in tree.features)
    scalar_func_t = ir.FunctionType(DOUBLE, func_dtypes)
    tree_func = ir.Function(module, scalar_func_t, name=str(tree))
    tree_func.linkage = "private"
    # populate function with IR
    gen_tree(tree, tree_func)
    return LTree(llvm_function=tree_func, class_id=tree.class_id)

But in data_processing.py with predict used, all feature param are convert to double*

def ndarray_to_ptr(data: np.ndarray):
    """
    Takes a 2D numpy array, converts to float64 if necessary and returns a pointer

    :param data: 2D numpy array. Copying is avoided if possible.
    :return: pointer to 1D array of dtype float64.
    """
    # ravel makes sure we get a contiguous array in memory and not some strided View
    data = data.astype(np.float64, copy=False, casting="same_kind").ravel()
    ptr = data.ctypes.data_as(POINTER(c_double))
    return ptr

Is this just like

int* predict(int* a, double* b);
double a = 1.1;
double b = 2.2;
predict(&a, &b);

Does this will happy in lleaves?

siboehm commented 2 years ago

TLDR: It's possible that there's a bug that causes a segfault, though it's unlikely that this is happening in the parts of the code you're pointing to.

For diagnosing the segfault: Could you run a minimally reproducing example with gdb to see which instruction triggers the segfault? There used to be an issue with overflows for very large datasets, but I fixed that a few months ago. If there's any way you can have a self-contained, minimally reproducible sample and send it to me (email is fine), I'd love to help you out.

Regarding the categorical data: The relevant function is actually this one: https://github.com/siboehm/lleaves/blob/9784625d8503c02e2679fafefb41c469b345566d/lleaves/compiler/codegen/codegen.py#L42 This is the function in the binary that lleaves calls from Python (using two double pointers). The categorical features are then cast to ints in the core loop here:https://github.com/siboehm/lleaves/blob/9784625d8503c02e2679fafefb41c469b345566d/lleaves/compiler/codegen/codegen.py#L205 Most of the processing of the Pandas dataframes follows LightGBM very closely. This double to int casting is a bit strange, but I wanted to follow LightGBM as closely as possible. It works since LightGBM doesn't allow categoricals > 2^31-1 (max int 32), but double can represent any int up to 2^53 and lower without loss of precision.

chenglin commented 2 years ago

I find that if categorical feature is numerical value, we can get rid of the code df[categorical_feature] = df[categorical_feature].astype('category') when prepared training data. We can just call lightgbm train function by set param categorical_feature=categorical_feature. In model file trained like this, pandas_categorical is null. May this issue related to this?

When I retrained a model that pandas_categorical is not null, the core dump disappeared.

PR: return empty list if pandas_categorical is null in model file BTW, I think we show keep pandas_categorical = None, when pandas_categorical: null in the model file.

siboehm commented 2 years ago

I'm having trouble understanding this issue. Could you write up a minimally reproducible example of the core dump / send me the model.txt that causes it?

siboehm / lleaves

Does this cause core dump ? #16