linkcharger / blablacar

0 stars 0 forks source link

Extracting ratings data and constructing ratings by trip #12

Open sebbaehralarcon opened 2 years ago

sebbaehralarcon commented 2 years ago

Ultimate Goal of this issue is to get a dataframe including driver name, driver ID, passenger name, passenger ID and ratings by trip.

In order to get there the following steps will be taken:

(1) Extract the ratings for each trip together with the name, publication date, ID of user leaving the review etc. of the review (2) extract the passenger names, ID

--> Match the rating and passenger on name or ID to get ratings by trip

sebbaehralarcon commented 2 years ago

@rdurante78 Update for today:

  1. all ratings for each trip have been extracted in form of a list
  2. passenger names, passenger ID has also been extracted

This means the relevant data needed to complete the matching is extractable from the JSON files, no problem here.

Currently facing the following issue:

Possible Solution: writing up a function that deconstructs the list manually and makes matching by name possible (working on it right now)

sebbaehralarcon commented 2 years ago

@rdurante78 Update for today:

Besides extracting the last few things necessary to complete the matching, today has been entirely devoted to understanding the structure of the ratings in our JSON-data and writing up a function to extract them.

"rating" is a key in each trip-dictionary. The "rating" key has at least a double list as value. However, when the driver has more then 100 ratings, the double list becomes a triple list, meaning one outer list with two inner lists. This is because for each 100 reviews one list item is added. For example, if a driver has 152 ratings, then the value of "rating" is an outer list with one inner list with 100 ratings and another inner list with 52 ratings. (Every single rating is a dictionary itself)

Possible Solutions:

  1. Flattening out the lists, meaning creating one bigger list with all the ratings out of all sublists and then looping over the dictionaries.

  2. Writing up big if-clauses accessing the lists depending on how many inner lists "rating" contains.

--> Of both of these solutions the first one is definitely more straightforward but when trying different flattening functions one sees that the dictionaries are also flattened out and information that is contained in the individual ratings is lost. That's why I am working on a work-around here, that is essentially doing the same without losing data in the dictionaries of the single ratings.

Regarding the second solution I have tried to write it up quick and dirty, but it leads to dubious results. An example would be, that one cannot access the ratings in the first inner list, but in the second. So again information is lost.

I have discussed with Davood (today and yesterday) about this issue since this is something that comes from the structure of the JSON file and I will continue to ask him about this. He has said that tomorrow he isn't available since it is a public holiday, but we will meet on Thursday.

In general I am confident that the data can be extracted without information being lost, it is just some serious data wrangling right now.

Going forward:

sebbaehralarcon commented 2 years ago

@rdurante78 Here yesterday's update:

For now I am eyeballing the JSON data again to find the easiest way to fix the matching section

sebbaehralarcon commented 2 years ago

@rdurante78 Lunch-time update:

sebbaehralarcon commented 2 years ago

@rdurante78 End-of-the-day update:

(1) Once matched over passenger_name = rating_name and (2) Second matched over passenger_uuid = rater_uuid

Tomorrow the data has to be cleaned and double-checked, since the last thing I ran today was the matching over UUID.

sebbaehralarcon commented 2 years ago

@rdurante78 Lunchtime update:

So far I have inspected and cleaned the data I extracted yesterday and on Wednesday. The issues and possible solutions are described below:

If they're generated in two different ways, then this can explain the big attrition

linkcharger commented 2 years ago

Sanity checks for @linkcharger :