Read up on tf-idf and understand the math. Look at the ngrams issue (what difference does it make if you change the number of n?).
Understand the code in the notebook and change where necessary. E.g, as Tinsley pointed out, the line labels = etcsl_data_df['text_name'][etcsl_data_df['length'] > 49] is really odd and ugly - that should be taken care of earlier, where the shortest compositions are thrown out. Figure out a way to do that.
Read up on tf-idf and understand the math. Look at the ngrams issue (what difference does it make if you change the number of n?). Understand the code in the notebook and change where necessary. E.g, as Tinsley pointed out, the line
labels = etcsl_data_df['text_name'][etcsl_data_df['length'] > 49]
is really odd and ugly - that should be taken care of earlier, where the shortest compositions are thrown out. Figure out a way to do that.