tomerm / MLClassification

Classification using ML approach for English / Hebrew / Arabic data sets
1 stars 2 forks source link

Cross validation #21

Open matanzuckerman opened 5 years ago

matanzuckerman commented 5 years ago
  1. When running the method ="cross validation" the test path is not exists (the test and train are combined and split in the script itself). Is there a way to fix it?

  2. In case I want to add another data preprocess (like removing stop words) should I just add it to the utils in Tokenization? and then in line 21 I should add the relevant data to ignore? and if I want to do stemming where should I add it? also in joinToken? I'm afraid it will be hard to maintain.

Thank you

tomerm commented 5 years ago

@matanzuckerman

  1. Cross validation is not supported at the moment. Not clear what do you mean,
  2. The suggestion is to create a new phase in the general pipe. Label it with new letter (i.e. "P") and add associated call (to relevant new code) in the main pipe entry code (file launcher.py): image
semion1956 commented 5 years ago

@matanzuckerman

  1. Cross-validation isn't realized yet. I wrote about this in the README
  2. Not sure that I understand your comment correctly. If you need some additional preprocessing for source text, you can add it into process Tokenization or create some additional process. Please explain, what exactly you mean.
matanzuckerman commented 5 years ago

Thank you @tomerm and @semion1956 ! I will wait for the next update with cross validation.

About the preprocessing. Right now as I can understand the preprocessing is happening under the function joinTokens (listed below) Here we remove the stop words, extra words and etc in case the user requested it. I'm not sure if it the right place to implement it all together. I would prefer that each preprocess will be separate (maybe will have a different letter). In this case it will be easier to monitor, update and add new preprocess methods (I'm going to add entity recognition preprocess in the client)

def joinTokens(tArr, Config): toks = [x[0] for x in tArr] tags = [x[1] for x in tArr] result = '' normalizer = ArabicNormalizer() if Config["stopwords"]: stopWords = set(stopwords.words('arabic')) else: stopWords = set() exPos = Config["expos"].split(",") exWords = Config["extrawords"].split(",") for i in range(len(tArr)): ftok = '' if i > 0: result += ' ' if tags[i] in exPos or tags[i] in stopWords or tags[i] in exWords: continue else: ftok = toks[i] if Config["normalization"]: ftok = normalizer.normalize(ftok) result += ftok return result

semion1956 commented 5 years ago

@matanzuckerman I am not sure, that adding new preprocessing method is the reason to rewrite all existing process. If you want to separate different types of preprocessing, you can configure Tokenizer to run for each of them seprarately.

tomerm commented 5 years ago

@matanzuckerman currently you can control all preprocessing options (like stop words removal ) via parameters which are passed to "T" phase:

You can control all those (turn them on/off) in a completely independent way. I think this is what was most important. request = T(stopWords=yes; normalization =yes) What you are suggesting seems to define each of those options as stand alone phase. request = S(stopWords=yes) | N (normalization =yes)

Do I understand you correctly ?

matanzuckerman commented 5 years ago

Hi @tomerm @semion1956 , After discussion we will added new preprocess as new letters. Under T I would like to have full control over the preprocess I'm doing (I would like to do just tokanization or just remove stopwords, do both and etc) we will control it with the parameters inside the T.

tomerm commented 5 years ago

@matanzuckerman we should probably open a new issue or rename this one (since currently the subject talks about cross validation while we discuss here quite different issue). The request to have tokenization functionality to be controlled separately from any other one (under module T) was addressed via https://github.com/tomerm/MLClassification/pull/25. Please have a look at the latest code and let us know if there is anything you want to improve.