Extracting ratings data and constructing ratings by trip

sebbaehralarcon commented 2 years ago

Ultimate Goal of this issue is to get a dataframe including driver name, driver ID, passenger name, passenger ID and ratings by trip.

In order to get there the following steps will be taken:

(1) Extract the ratings for each trip together with the name, publication date, ID of user leaving the review etc. of the review (2) extract the passenger names, ID

--> Match the rating and passenger on name or ID to get ratings by trip

sebbaehralarcon commented 2 years ago

@rdurante78 Update for today:

all ratings for each trip have been extracted in form of a list
passenger names, passenger ID has also been extracted

This means the relevant data needed to complete the matching is extractable from the JSON files, no problem here.

Currently facing the following issue:

The list of the ratings for each trip is hard to compute since there are a lot of ratings that are attached to some drivers, mostly up to 40-50, so it takes a lot of computing power to run this, mostly 10-15 minutes.
Before being able to do the matching, the list of ratings has to be split up but the JSON file structure doesn't allow a very easy access to the list

Possible Solution: writing up a function that deconstructs the list manually and makes matching by name possible (working on it right now)

sebbaehralarcon commented 2 years ago

@rdurante78 Update for today:

Besides extracting the last few things necessary to complete the matching, today has been entirely devoted to understanding the structure of the ratings in our JSON-data and writing up a function to extract them.

The structure of the ratings is the following:

"rating" is a key in each trip-dictionary. The "rating" key has at least a double list as value. However, when the driver has more then 100 ratings, the double list becomes a triple list, meaning one outer list with two inner lists. This is because for each 100 reviews one list item is added. For example, if a driver has 152 ratings, then the value of "rating" is an outer list with one inner list with 100 ratings and another inner list with 52 ratings. (Every single rating is a dictionary itself)

Possible Solutions:

Flattening out the lists, meaning creating one bigger list with all the ratings out of all sublists and then looping over the dictionaries.
Writing up big if-clauses accessing the lists depending on how many inner lists "rating" contains.

--> Of both of these solutions the first one is definitely more straightforward but when trying different flattening functions one sees that the dictionaries are also flattened out and information that is contained in the individual ratings is lost. That's why I am working on a work-around here, that is essentially doing the same without losing data in the dictionaries of the single ratings.

Regarding the second solution I have tried to write it up quick and dirty, but it leads to dubious results. An example would be, that one cannot access the ratings in the first inner list, but in the second. So again information is lost.

I have discussed with Davood (today and yesterday) about this issue since this is something that comes from the structure of the JSON file and I will continue to ask him about this. He has said that tomorrow he isn't available since it is a public holiday, but we will meet on Thursday.

In general I am confident that the data can be extracted without information being lost, it is just some serious data wrangling right now.

Going forward:

I will continue working on the flattening approach since it seems the most promising and most straightforward path to pursue.

sebbaehralarcon commented 2 years ago

@rdurante78 Here yesterday's update:

Very good news, the list problem has been solved. In the end I had to split the code up into single functions with single purposes, but the ratings information is now fully extracted.
Additionally, I have started coding up the matching section and ran a couple of trials. The matching has not been successful so far but I am confident that I am close, especially if I talk with Davood today, who will be able to double check some of the code sections for the matching.

For now I am eyeballing the JSON data again to find the easiest way to fix the matching section

sebbaehralarcon commented 2 years ago

@rdurante78 Lunch-time update:

First successful extraction/matching of ratings by trip, however now checking if there are no flaws in the code and if the names have been matched correctly

sebbaehralarcon commented 2 years ago

@rdurante78 End-of-the-day update:

Successful day, ratings by trip have been extracted on two ways (even though not cleaned yet):

(1) Once matched over passenger_name = rating_name and (2) Second matched over passenger_uuid = rater_uuid

Tomorrow the data has to be cleaned and double-checked, since the last thing I ran today was the matching over UUID.

sebbaehralarcon commented 2 years ago

@rdurante78 Lunchtime update:

So far I have inspected and cleaned the data I extracted yesterday and on Wednesday. The issues and possible solutions are described below:

the ratings by trip have been matched using two different approaches as can be read in this comment, when inspecting the data one can see that there is not many trips for which we can extract relevant_ratings when matching by name roughly 4500, though when matching on names (passenger_name = rater_name) one has to note that the names aren't unique.
the second approach leads (passenger_uuid = rater_uuid) to less relevant_ratings, but here I have to discuss with Davood and Aaron if the UUID's of both the passenger and the rater are actually the same and not generated in two different ways.

If they're generated in two different ways, then this can explain the big attrition

linkcharger commented 2 years ago

Sanity checks for @linkcharger :

number of (unique) trip)_ids
on the websites: number of trips vs number of ratings -> get average

linkcharger / blablacar

Extracting ratings data and constructing ratings by trip #12