interactive labelling working with pandas dataframes

Hello! I'm working to classify a set of video transcripts (auto-generated by AWS), based on their content. I have a set of labels to assign to my yet unlabelled data. I would like to use (inter)active learning to minimize the cost of human labelling until I reach a sufficient accuracy to be confident of the prediction of my model. My data is a pandas dataframe format, my features are the video identifier, the text transcript and others. I've performed the tfidf vectorization with sklearn in order to classify the transcripts, using the SVM classifier from sklearn. I would like to keep track of the labels I give to the learner object and save them in the original dataframe where I keep my data, but the problem is that when I "mix" the sparse matrix rows, moving them from X_pool to X_train, I have no way to link the sparse matrix line to the transcript in the original dataframe. Do you have any idea how I could do it? Have you ever had a similar problem? I've read many examples from your documentation but in every one of them the labels are already present from the beginning, they are part of the initial data and we are just "pretending" not to have them. I thought that a solution (even though not an efficient one) could be to transform the sparse matrix in a pandas df, so that it would be possible to link the two dfs by index. If modAL implemented support for pandas df, this could be a way to solve the problem. Would you happen to have any other suggestions on how to tackle my issue? Thank you so much! E.M

modAL-python / modAL

interactive labelling working with pandas dataframes #142