Tokenization and data loading

matanzuckerman commented 5 years ago

Hi @tomerm @semion1956,

As it seems, today I need to run Tokenization part on the raw data and then load the output for the models. The problem as I can see it is that we are going to run many tests and if each time I will need to create a new folder with the new files/delete the old own it can be confusing. can we implement that in case I did the data loader before the tokenizaition part the preprocess will be done on the files we loaded? and there is no need to save them if it wasn't explicit (another parameter)

Thanks

semion1956 commented 5 years ago

@matanzuckerman Few things to be clear (obviuosly you know this):

You should not delete old target folder with all its content - Tokenizer do this itself.
You can change target folder in configuration and get new results in the another folder. Anyway, adding possibility of tokenization in the process of loading - "on the fly" - make sense. OK.

tomerm commented 5 years ago

@matanzuckerman can you please clarify what use case you have in mind ? You mentioned multiple experiments. Do you expect that results from each of those will be persisted separately ? If we truly wish to support this scenario we MUST have some serious persistence system in which all stuff associated with specific experiment can be reliable stored / retrieved. If I understand correctly, if configuration file remains unchanged (at least as far as it concerns folders in which raw data is stored), each successive experiment overwrites intermediate / final results from the previous one. By making different configuration this can be of course avoided.

I am not sure I understand how can you load data without tokenizing it first. I understand how technically this can be done. But I don't undetrstand the purpose. Also, please pay your attention that data is loaded in Python code while all actual preprocessing (including tokenization) is implemented in Java code. Interlocking between those two worlds will be very time / resources consuming. A natural way out of this is to "convert" all Java code into Python (or find alternative libraries in Python). As you can imagine this is not a cheap option.

semion1956 commented 5 years ago

@matanzuckerman I have the impression that we do not quite understand each other. We see tokenization (or any other type of preprocessing) as some first phase in achieving the ultimate goal - the classification of documents. This phase is certainly important, but it is not an end in itself and it does not have significant effect on the final result. If you need complex processing of the original documents for some additional purposes, then explain, please, what exactly and why do you need this. This will help us to avoid misunderstanding.

tomerm commented 5 years ago

@semion1956 I don't pretend that I have a full picture here but at the very least there are several reasons to play with the data during pre-processing phase:

Cleaning - data extraction is made from various formats including all kind of unnecessary garbage text (i.e. markup, non ASCII characters etc.)
Reducing clutter in the data which has potential impact on the classification - stop words immediately come to mind. However number of heuristics can be significantly larger (i.e. removing words which appear frequently, remove specific person names etc.)
Running some ad-hoc rule based classification even before we approach machine learning phase. This can be done based on some custom taxonomy or similar means.

All those manipulations are aimed at raising accuracy during machine learning phase. Please observe that as far as I can see, in some experiments indeed pre-processing is just first phase (in a long pipe including ML), while in others it is the only phase. From the code logic I don't think it should matter much.

@matanzuckerman What I do agree with @semion1956 on is that we should not marry Python code with Java one , making it overall expandable for some future not very well defined usage / use case. This requires some serious architecture discussion and we are not at the right phase in the project for this to happen.

When data is loaded for training it happens in Python code. If we wish to tokenize this data we should:

first export it to file system and
then run Java preprocessing on it,
finally upload it again to Python from file system.

semion1956 commented 5 years ago

@tomerm We both don't have the full picture - this is why I ask to get us more explanations. What about Java: using java processing in the line-to-line mode is very expensive. This is why, if we need tokenize big value of input data, we prefer to do all work (e.g., completely preprocess all input) in one java call - so, completely in java. However, in runtime of the future customer's app, work in client-server mode is preferable.

matanzuckerman commented 5 years ago

@semion1956 @tomerm Hi and sorry for the long response.

In our experiments, we are trying many configurations. With stop words or without, with removal of entities or without and more. In each experiment we are saving the results in some excel sheet (so it does not need to bother you) but it would be nice if the output of the model will be the html file we used to work with. If I understand you, for better performances, it's better that we will save the raw data in specific folder run preprocessing and save the output in different folder. Load again the data from the new folder and train + test the models. Is that correct? In this case we multiply our data by 2? and in each experiment It will overwrite the preprocessed files (from the "new folder")?

Thanks

tomerm commented 5 years ago

It is clear that many experiments are being conducted. It is not clear if you refer to end-to-end experiment (when full NLP pipe is executed) or even for just pre-processing phase. So far the discussion was just around preprocessing phase. The way data is passed between stages during preprocessing is because it happens in Java code (being called from higher level Python code). Massive amount of data being processed by Java code are best to be processed in bulks. Thus we leverage File System for the intermediate storage in this context.

Saving experiment results in Excel is up to you. There was no such a request. We definitely continue to make result data available in the form of HTML (almost JS app) for the sake of analysis and comparison between experiments / models etc.

If I understand you, for better performances, it's better that we will save the raw data in specific folder run >preprocessing and save the output in different folder. Load again the data from the new folder and train + >test the models. Is that correct?

I don't think the decision to have input data and output data in different folders is dictated by the performance considerations. Via config file you can make sure that the same folder is used. I think separate folders are used for the sake of reducing clutter and making the maintenance job slightly easier (especially when multiple experiments will be carried out).

Indeed in different experiments, if you don't change configuration file, results of preprocessing will overwrite each other in the same output folder. That is why it is important to run different experiments with different configuration files.

There is a way to avoid temporary storage of preprocessed data on the file system. In this case Python initiates data set files reading, while Java serve requested sent to it by Python. This process takes more time (as compared to case in which Java takes files from folder A, process them and put them in folder B), however, preprocessed data is always stored in RAM (and not persisted to file system).

In my humble opinion the best way to cope with flexibility and complexity of data (along with growing appetite from customer side) is to leverage serious storage system (i.e. COS). Please observe that currently there is no way to perform queries like following ones:

give me all models with accuracy > X which were built between Oct 2 and Nov 23
give me all models associated with data set Y
give me all models which were trained on data sets including file Z with version 1.2.3

tomerm / MLClassification

Tokenization and data loading #24