hipster-philology / pandora

A Tagger-Lemmatizer for Natural Languages
MIT License
9 stars 4 forks source link

Changes in training and prediction commands #9

Closed Jean-Baptiste-Camps closed 7 years ago

Jean-Baptiste-Camps commented 7 years ago

I have made two changes,

  1. add a --tokenized_input option to unseen.py, and edited documentation;
  2. add a --load option to main.py, to allow loading and training on top of existing model (to train in a few sessions).

The first one is quite simple, but the second modification is heavier. It works fine by me, though, but let me know what you think.

PonteIneptique commented 7 years ago

It seems fine to me. I did not try it but I do not see any reason not to merge. Up to @mikekestemont to decide :)

Jean-Baptiste-Camps commented 7 years ago

I have just noticed saving did not work as intended because of a redundancy in the code of tagger.py. My question now is: do we really need it ? In my opinion, the second is sufficient, but I might miss something. Here is the code in question:

    # save config file:
    if self.config_path:
        # make sure that we can reproduce parametrization when reloading:
        if not self.config_path == os.sep.join((self.model_dir, 'config.txt')):
            shutil.copy(self.config_path, os.sep.join((self.model_dir, 'config.txt')))
    else:
        with open(os.sep.join((self.model_dir, 'config.txt')), 'w') as F:
            F.write('# Parameter file\n\n[global]\n')
            F.write('nb_encoding_layers = '+str(self.nb_encoding_layers)+'\n')
            F.write('nb_dense_dims = '+str(self.nb_dense_dims)+'\n')
            F.write('batch_size = '+str(self.batch_size)+'\n')
            F.write('nb_left_tokens = '+str(self.nb_left_tokens)+'\n')
            F.write('nb_right_tokens = '+str(self.nb_right_tokens)+'\n')
            F.write('nb_embedding_dims = '+str(self.nb_embedding_dims)+'\n')
            F.write('model_dir = '+str(self.model_dir)+'\n')
            F.write('postcorrect = '+str(self.postcorrect)+'\n')
            F.write('nb_filters = '+str(self.nb_filters)+'\n')
            F.write('filter_length = '+str(self.filter_length)+'\n')
            F.write('focus_repr = '+str(self.focus_repr)+'\n')
            F.write('dropout_level = '+str(self.dropout_level)+'\n')
            F.write('include_token = '+str(self.include_context)+'\n')
            F.write('include_context = '+str(self.include_context)+'\n')
            F.write('include_lemma = '+str(self.include_lemma)+'\n')
            F.write('include_pos = '+str(self.include_pos)+'\n')
            F.write('include_morph = '+str(self.include_morph)+'\n')
            F.write('include_dev = '+str(self.include_dev)+'\n')
            F.write('include_test = '+str(self.include_test)+'\n')
            F.write('nb_epochs = '+str(self.nb_epochs)+'\n')
            F.write('halve_lr_at = '+str(self.halve_lr_at)+'\n')
            F.write('max_token_len = '+str(self.max_token_len)+'\n')
            F.write('min_token_freq_emb = '+str(self.min_token_freq_emb)+'\n')
            F.write('min_lem_cnt = '+str(self.min_lem_cnt)+'\n')
            F.write('curr_nb_epochs = '+str(self.curr_nb_epochs)+'\n')