Summary Statistics for Cleaned Dataset

losDaniel / Student-Voices

Analyze millions of teacher reviews from every English speaking country using natural language processing

1 stars 0 forks source link

Summary Statistics for Cleaned Dataset #2

Open losDaniel opened 4 years ago

losDaniel commented 4 years ago

For the final cleaned dataset (where reviews shorter that 100 or 150 characters have already been removed. Provide the following summary statistics:

Number of Teachers Reviewed
Mean and Std of reviews per teacher
Average Rating per teacher and standard deviation (do teachers get only-positive or only-negative, whats the degree of mix?)
The number of Male vs. Female Teachers
Ratings of Male vs. Female Teachers
Title 1 schools.
Title 1 school ratings
Title 1 school topics
Offensive or Insulting Words before and After Cleaning

losDaniel commented 4 years ago

Ok, I need to find the cleaned data and the rated data.

I've found the full data. I also have the rating indices so that I can separate the labels. That will give me the full universe of the bad ratings (body A). I want to find insults and all the sinonyms using one of those analogy notebooks I had made earlier.
I also need to find the indices for the smaller cleaned corpus though. The cleaned_docs only have the text ready for processing. I haven't found an intermediate version. I think I need to recreate the rules unless there's a variables that can help me in the data.
I hope the rules are simple I I don't have to run the clean and modify outputs.

losDaniel commented 4 years ago

Don't even worry about the exact cleaned docs, only apply the 100 (or 150) character limit. We want to find which percentage of the raw reviews were still insulting if they were longer. Lets find that proportion.

losDaniel commented 4 years ago

The old versions of project are not loading properly. I might have to download the whole thing from dropbox but I doubt that would make a difference. It might just be the checkpoints, I might have to move them from the archive to the main dir.

losDaniel commented 4 years ago

I also need the ids of the cleaned docs for all the rest of the fucking reviews by the way. ARG

losDaniel commented 4 years ago

It doesn't look like we can get gender from the teacher descriptions. It may just have to be from the reviews (he or she and shit like that).

losDaniel commented 4 years ago

Identified the sample reviews. As easy as reducing the sample by the ratings and then restricting review length to 100 characters.

losDaniel commented 4 years ago