sergioburdisso / pyss3

A Python package implementing a new interpretable machine learning model for text classification (with visualization tools for Explainable AI :octocat:)
https://pyss3.readthedocs.io
MIT License
333 stars 44 forks source link

Data loading issues while train #16

Closed Practcdi closed 2 years ago

Practcdi commented 2 years ago

Hey ,

[Note] : I have pandas dataframe contain 2 columns as ,

1) Text 2) Label


from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data ,
                                                    test_size = 0.2, 
                                                    shuffle=False)

train () and fit() methods are not working

here is a reference code

image

How to fix it?

Thanks

angrymeir commented 2 years ago

Hey @Practcdi,

TLDR

this is due to the case, that fit/train requires a list of strings instead of a DataFrame. (See function documentation here)

Fix: pass x_train.values.tolist(), y_train to clf.train()

Bit more insights on why it does not work:

Following the respective code lines (here):

x_train, y_train = list(x_train), list(y_train)

if len(x_train) != len(y_train):
    raise ValueError("`x_train` and `y_train` must have the same length")

If you pass a dataframe to the variable x_train of shape = (535544, 1) casting this to a list will only return the column names. Thus the check will compare the following:

if 1 != 535544:
    raise ValueError("`x_train` and `y_train` must have the same length")
Practcdi commented 2 years ago

Hey @Practcdi,

TLDR

this is due to the case, that fit/train requires a list of strings instead of a DataFrame. (See function documentation here)

Fix: pass x_train.values.tolist(), y_train to clf.train()

Bit more insights on why it does not work:

Following the respective code lines (here):

x_train, y_train = list(x_train), list(y_train)

if len(x_train) != len(y_train):
    raise ValueError("`x_train` and `y_train` must have the same length")

If you pass a dataframe to the variable x_train of shape = (535544, 1) casting this to a list will only return the column names. Thus the check will compare the following:

if 1 != 535544:
    raise ValueError("`x_train` and `y_train` must have the same length")

Thanks lot 😊

sergioburdisso commented 2 years ago

@Practcdi Thanks for sharing this issue with us!

@angrymeir Thanks for taking care of it :muscle:, btw, what do you think of adding an extra check at the beginning of fit/train throwing an ValueError exception saying something like "the x_train argument is expected to be a list of strings" when the provided x_train isn't a list of string. :thinking:

angrymeir commented 2 years ago

@sergioburdisso Hm unsure about that one because...

  1. Where to start and where to end? Is it only fit/train that needs this kind of validation or also other methods (potentially all methods with user input because of consistency)?
  2. I think it's difficult to detect if a x_train can be casted to a list of strings without information loss. E.g. while pandas.DataFrame can't be casted, pandas.Series can be casted without issues, so it should stay a valid option?
  3. Its well documented, stating exactly what the function expects.