Open matanzuckerman opened 5 years ago
@matanzuckerman
@matanzuckerman
Thank you @tomerm and @semion1956 ! I will wait for the next update with cross validation.
About the preprocessing. Right now as I can understand the preprocessing is happening under the function joinTokens (listed below) Here we remove the stop words, extra words and etc in case the user requested it. I'm not sure if it the right place to implement it all together. I would prefer that each preprocess will be separate (maybe will have a different letter). In this case it will be easier to monitor, update and add new preprocess methods (I'm going to add entity recognition preprocess in the client)
def joinTokens(tArr, Config): toks = [x[0] for x in tArr] tags = [x[1] for x in tArr] result = '' normalizer = ArabicNormalizer() if Config["stopwords"]: stopWords = set(stopwords.words('arabic')) else: stopWords = set() exPos = Config["expos"].split(",") exWords = Config["extrawords"].split(",") for i in range(len(tArr)): ftok = '' if i > 0: result += ' ' if tags[i] in exPos or tags[i] in stopWords or tags[i] in exWords: continue else: ftok = toks[i] if Config["normalization"]: ftok = normalizer.normalize(ftok) result += ftok return result
@matanzuckerman I am not sure, that adding new preprocessing method is the reason to rewrite all existing process. If you want to separate different types of preprocessing, you can configure Tokenizer to run for each of them seprarately.
@matanzuckerman currently you can control all preprocessing options (like stop words removal ) via parameters which are passed to "T" phase:
You can control all those (turn them on/off) in a completely independent way. I think this is what was most important. request = T(stopWords=yes; normalization =yes) What you are suggesting seems to define each of those options as stand alone phase. request = S(stopWords=yes) | N (normalization =yes)
Do I understand you correctly ?
Hi @tomerm @semion1956 , After discussion we will added new preprocess as new letters. Under T I would like to have full control over the preprocess I'm doing (I would like to do just tokanization or just remove stopwords, do both and etc) we will control it with the parameters inside the T.
@matanzuckerman we should probably open a new issue or rename this one (since currently the subject talks about cross validation while we discuss here quite different issue). The request to have tokenization functionality to be controlled separately from any other one (under module T) was addressed via https://github.com/tomerm/MLClassification/pull/25. Please have a look at the latest code and let us know if there is anything you want to improve.
When running the method ="cross validation" the test path is not exists (the test and train are combined and split in the script itself). Is there a way to fix it?
In case I want to add another data preprocess (like removing stop words) should I just add it to the utils in Tokenization? and then in line 21 I should add the relevant data to ignore? and if I want to do stemming where should I add it? also in joinToken? I'm afraid it will be hard to maintain.
Thank you