rpasquini / twitter_and_displacement

Predictive analysis of gentrification and displacement on the basis of geo-tagged tweets based metrics.
2 stars 5 forks source link

Reworking Data Upload Functions to Work with Old Data #7

Open Emman-Lopez-Oso opened 3 years ago

Emman-Lopez-Oso commented 3 years ago

San Francisco and New York's raw twitter data are a couple of years older than the Bogota, Buenos Aires, Hong Kong, and Sydney data. The data that we have access to currently only has the following fields:

The date field is peculiar because it has the data format: 'YYYY-MM-DDT00:00:00z' I believe that this data format is what causes the upload process to break.

Solution: Investigate how to adjust parameters in data upload to convert this field into a proper date time format into MongoDB.

jenniferghu commented 3 years ago

I think that the New York data has some null data. When running the function .apply(lambda row: int(row.timestamp())*1000), I get the error on "NaTType does not support timestamp", which makes me believe that some of the data is incomplete.

rpasquini commented 3 years ago

@jenniferghu Thanks for pointing that out. I would suggest the following. First, you should be able to confirm if that is really the problem just by reading the csv to a dataframe (in chunks if necessary) and search for Nans in the date field. I would also suggest to implement a simple function to count the number of such cases, so we are able to diagnose the magnitude of the problem. If the problem is not severe (low percentage relative to total tweets) we could drop those rows of the dataframe before inserting it to Mongo. But lets diagnose the issue carefully first.

jenniferghu commented 3 years ago

After running section 5 for over 24 hours, I got an error that says "cursor id 7112697679657937000 not found". Since I've been running this notebook for SF for about 48 hours, I've just continued to let the rest of the notebook run in the meantime. I'll attach error messages on Slack, but if the issue were to be fixed, would section 5 run from iteration 0 all over again?

rpasquini commented 3 years ago

Hi @jenniferghu apologies for not answering before. Did you manage to complete the task? To clarify, if the process in section 5, which is supposed to take long, is interrupted for some reason, you should be able to resume it, just by running the function again. The function should continue the process where it was interrupted.

jenniferghu commented 3 years ago

no worries @rpasquini . For the past few days, I have been unable to access compass. a few days ago, I got the error "connect ECONNREFUSED 3.14.72.122:27017", but now I am getting the error "authentication refused". I have not changed my device or my location, so I am unsure why I cannot connect.

rpasquini commented 3 years ago

Hi @jenniferghu Hope you are well. The server is now up and running. I could not figure out what happened. But I managed to put it back to work so let me know if you face any further problems.