marmarbar003 / TxtMin-Project-SA

0 stars 0 forks source link

Project Update 1 #1

Open hannahhommel opened 1 month ago

hannahhommel commented 1 month ago

This is where we will document project update 1

marmarbar003 commented 1 month ago

After meeting up during the break we were able to find a new dataset that seems to be more promising than the others since it only holds reviews of Apple products. This is the new data we decided to pre-process link. At the moment we have made a simple pre-processing code. The code in the repository deletes useless attributes such as the date the review is written. Then it simplifies one of the attributes of the star rating from being a long sentence saying (ex. 3.0 out of 5.0 stars) to a simple integer (3). The next thing it does is simplify the helper count to an integer just in case later on we want to analyze whether a certain gender is perceived as more useful in these reviews. Last it starts eliminating any profile names that have a default name (Amazon Customer/customer) or which has a length smaller than 2 since it is very rare for a name to be 2 characters long. Now we are investigating what corpuses we should use to classify whether a person is male or female. In this case, since all of the reviewers are Indian we have found this corpus which contains around 14K male-names.. To deal with a unisex name we plan on creating our list of names depending on if one of us finds such a list to filter them out making the rest of the reviewers in theory female. The next step of the pre-processing which will be done by the end of this week will consist of applying the corpus of the male names above and finding a similar one for unisex names to single female names or creating our Indian unisex names with the help of online generators.