nirbarazida / Data-mining-project

ITC - Data mining project
0 stars 1 forks source link

Characters encoding in SQL server on AWS server #18

Open nirbarazida opened 4 years ago

nirbarazida commented 4 years ago

when running the code on a local server our Char encoding default is UTF-8mb4 (that's 4 bytes) The AWS Char encoding default is UTF-8 (that's 3 bytes) as shown in the picture. In this case, special characters (Chinese, Russian, figures, etc.) could not be encoded Thus the program crushes. To solve the issue we can choose 2 options:

  1. change the encoding with every connection in the pymysql connection (shown in the picture)
  2. create an init connection value in the AWS MySQL server.

image

nirbarazida commented 4 years ago

The solution is actually to encode the characters at the point of creating the engine as shown in the picture below

code that was added : '?charset=utf8mb4'

final code:

engine = create_engine(f"{config.SQL_EXTENSION}+{config.PYTHON_DBAPI}://{config.USER_NAME}:"
                           f"{config.PASSWORD}@localhost/{config.DB_NAME}?charset=utf8mb4")

image

nirbarazida commented 4 years ago

Although I succeeded one time to insert a user with a special string that requires UTS-8mb4 to the database - it didn't occur again.

good: image bad: image

I have tried multi disciplines to solve the problem, however none have worked. From what I've searched, this is a known problem. I've decided to solve the problem by encoding to UTF-8 all string that Stack Exchange users can freely write.

all changes have been made in user class. all new variables have been created in the json file and config class.

image

nirbarazida commented 4 years ago

for some users, all characters require encoding - Thus getting AttributeError. to solve this problem used Try and except. because the user name isn't necessary for the data Analysis we will change it to the website name + user rank

After conducting an experiment on scraping 500 users the code work fine.

image

nirbarazida commented 4 years ago

location created an issue when queering the database to check if location string excesses in Stack_Exchange_Location Table. Thus, needed to encode the location string right after scraping it.

image

nirbarazida commented 4 years ago

This time the program crashed because of using different types of letters. Added another 'Bandage' just to keep the server running by checking if the string is printable:

image

we must figure how to encode utf-8mb4 on the AWS server. I'm am sure that the solution below will work: image

green - the solution - encoding the database when creating it. yellow - insurance - encoding specific tables and columns.

I think that you've changed the MySQL config to latin1 thus it doesn't work. image