marmarbar003 / TxtMin-Project-SA

0 stars 0 forks source link

Project Update 2 #2

Open marmarbar003 opened 1 month ago

marmarbar003 commented 1 month ago

New updates: main focus as been to find a large corpus to determine the genders of the names. We have found one which is implemented in C while we are working with python. In order to fix this we can try and access the data and search for the names that we are interesting in which would be any Indian names in it. Another possible solution to this could be to quickly transform our file to C but I think that if we want to apply SA it is best to use python due to its useful libraries. What do you think? A part from that big corpus there we have a male corpus mentioned earlier that after some issues downloading it it has been added to github. So this will serve as a basis of male names. This dictionary will be added tomorrow by midday since separating the name and last name code part has been having some bugs which have been attempted to be fixed. In the positive light of this we have used regular expressions in order to simplify certain profile_names to gender or simply apply a certain filtering. For example some profile_names contain Mr. and Ms will be assigned male and female respectively. There are some users who decide to add a Dr. at the beginning so the prefix would be removed and the corpus analysis would be applied to determine the gender.