niekveldhuis / Digital-Assyriology

Tools and Examples for Computational Text Analysis for Assyriologists.
11 stars 2 forks source link

tf-idf/clustering #20

Open niekveldhuis opened 8 years ago

niekveldhuis commented 8 years ago

Read up on tf-idf and understand the math. Look at the ngrams issue (what difference does it make if you change the number of n?). Understand the code in the notebook and change where necessary. E.g, as Tinsley pointed out, the line labels = etcsl_data_df['text_name'][etcsl_data_df['length'] > 49] is really odd and ugly - that should be taken care of earlier, where the shortest compositions are thrown out. Figure out a way to do that.

niekveldhuis commented 8 years ago

Continue with the tf-idf and look into LSA as perhaps another useful method for analysis.