Reverse-engineer David's workflow

linkcharger commented 2 years ago

This page should serve as a record for what we find out while we go through Davood's mess of files and codes. We write down progressively what things we figure out, so that in the end, when someone reads through this, they should understand the work and data flow. For a neat documentation, this should still be restructured, but at least all the insights we had are collected here.

rdurante78 commented 2 years ago

@linkcharger Email Davood to tell him you are going through his mess of files and codes and that you are going to write him to ask clarifications, and that I expect him to reply quickly.

linkcharger commented 2 years ago

Status update:

We think we figured out about the first third of the workflow. We read and understand the first bit of code that runs 5x a day to get the trips posted on the website, including some minimal information about those trips. The respective code and data files we put into the folder '01_scrape_trips'. There is a part of the code that downloads data and saves the raw data as JSON files, and there is a part which cleans this data up and converts it into CSV tables.

The issues with this first step are the following:

the code could be cleaned up and consolidated for better clarity (splitting the downloading and cleaning) , in case we need to work on it in the future
there are two output folders which seemingly contain the same data, namely the cleaned CSV files. We are unsure whether this is indeed a duplicate, a remnant of the development process or data that was processed in some other way.

linkcharger commented 2 years ago

Here is the overleaf document where the final complete workflow will be elaborated.

We will continuously add to this as we go along.

linkcharger commented 2 years ago

Meeting with Emil

missing variables:

ratings
automatic acceptance
driver bio, luggage

variable name issues:

day_count
passenger_total vs passenger_total_obs
norace

rerun whole workflow only if important variables (automatic acceptance, ratings) missing
if more information: information missing about driver (bio, luggage, other control variables)

sebbaehralarcon commented 2 years ago

@linkcharger Small Update about the synching process:

It turns out that the synching process takes even longer for me than though. Dropbox displays that the files were synched today at 6.41 am. Maybe you're approach from yesterday is not only beneficial but also needed.

sebbaehralarcon commented 2 years ago

Small Update: (for Meeting with Emil)

Regarding automatic acceptance named here we can see something in the raw_JSON_dumps txt-files named "approval mode" which can be seen as "MANUAL" or "AUTOMATIC"
Regarding ratings named in the same comment, we can identify "rating" in the JSON dumps which displays the overall ratings as "overall" and the total number of ratings as "total number"
- In more detail about the ratings: As a general structure in the JSON dumps, we have overall rating and total number of ratings as said above, plus all individual ratings given to that driver up to this ride. These individual ratings include the rating measured in stars called "global rating"(from 1 to 5) and the "comment" of the individual giving that rating (Additionally, the individual is categorised with a "role" that can be either "passenger" or "driver". The latter is given when a passenger in this ride is himself a driver in the database - not 100% sure yet.)
the driver bio cannot be identified in the JSON dumps for now (will update this further along the way)

linkcharger commented 2 years ago

num_id is only the numerical part of the trip_id

linkcharger commented 2 years ago

The file currently named 'create_day_drips_ethnicities.py' does the following:

it merges the ethnicity prediction from the NN (one for each person) with the trip and ratings records
thereby associates different people with each other
calculates averages of (accepted) ethnicities for each driver, one based on observed rides, one based on people in ratings
saves it to 'trips_ethnicities/day_trips/ethnicitytrips[date].csv'

linkcharger commented 2 years ago

Does automatic acceptance change over time (much)?

linkcharger commented 2 years ago

@EmilPalikot We just talked with Ruben: Next week we will extract the automatic/manual acceptance variable together with the trip_id and driver_id in a quick and dirty way. Do you want also the average rating and name of the reviewers?

linkcharger commented 2 years ago

@EmilPalikot How long is the list of pictures we were not able to download?

EmilPalikot commented 2 years ago

@linkcharger is the list that I send over the e-mail useful? I think I can match it to ride ids with that helps

linkcharger commented 2 years ago

Firstly of note is that this is a list of missing ethnicities, which thus encapsulates both missing pictures and missing classifications.

Next, it does seem like most of the missing pictures are 'missing' in the sense that they have/had the default profile picture, with a default URL. That makes me believe that even if we are to find the user based on their ID, we could still be missing a significant portion of pictures for the same reason.

Finally, it will take a bit more time until I have the code ready to take only the ID as input and spit out the pictures and ethnicities. You can follow the progress on #3.

sebbaehralarcon commented 2 years ago

Hey guys, I just noticed I haven't posted an Update in here about the Quick-n-Dirty Solution for extracting additional Information. That's why here a full update.

Sitrep

Firstly, I have extracted the majority of the information requested from the raw-JSON-files that we have given (from David's raw data)
Concerning the missing values I have checked the raw-JSON-data and these missing values might be from rides that were deleted on the blablacar page (for reasons like no sign-ups or other things) and was still scraped by David's code
Here I would need to ask @EmilPalikot some questions: Do you have information in the data David sent you on the trips that you see with missing values in my data set or has he deleted them and they're just not in the data set? Because I have checked David's raw JSON files and for these trips there is just the ID but no additional information at all.
When I have extracted all the information needed I will send a version of the data with the missing values and without and then we can see from there

linkcharger commented 2 years ago

As for how Davood created the datasets, here is the first part: for the trips (only basic informatin) that are scraped 5x a day.

he downloads ALL search results for a given search (a city pair) on a given day, 5x a day
at the end of the day, he puts observations of one day in one table
he drops trips which have no (alphanumeric) trip_id
he sorts the remaining trips by the numeric trip_id and the 'day_counter' variable (ascending)
he drops duplicates according to the rows 'num_id', 'DeptNum', 'destination', keeping only the first observation (in the order previously sorted)

This would mean that if a trip is observed multiple times in one day, only the first instance is preserved. This makes sense (giving us the earliest time the trip was seen). So far so good.

This does not yet answer how trips that are seen on multiple days are dealt with. I will get to that shortly.

rdurante78 commented 2 years ago

Thanks. Question:

If I post an ad in the morning and I accept a passenger in the afternoon, the dataset will only contain the first observation so no indication of the passenger?

Ruben Durante Full Professor, Universitat Pompeu Fabra Visiting Professor, PSE & INSEAD BSE, IPEG, CEPR, CESifo and IZA www.rubendurante.net https://us02web.zoom.us/my/rubendurante

On Tue, Sep 27, 2022 at 2:10 PM Aaron @.***> wrote:

As for how Davood created the datasets, here is the first part: for the trips (only basic informatin) that are scraped 5x a day.

he downloads ALL search results for a given search (a city pair) on a given day

he puts obsercations of one day in one table

he drops trips which have no (alphanumeric) trip_id

he sorts the remaining trips by the numeric trip_id and the 'day_counter' variable (ascending)

he drops duplicates according to the rows 'num_id', 'DeptNum', 'destination', keeping only the first observation (in the order previously sorted)

This would mean that if a trip is observed multiple times in one day, only the first instance is preserved. This makes sense (giving us the earliest time the trip was seen). So far so good.

This does not yet answer how trips that are seen on multiple days are dealt with. I will get to that shortly.

— Reply to this email directly, view it on GitHub https://github.com/linkcharger/blablacar/issues/2#issuecomment-1259414065, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYCTO6VLWGWL5CPBMEWZSLWALP4TANCNFSM6AAAAAAQHYQ7LM . You are receiving this because you were assigned.Message ID: @.***>

linkcharger commented 2 years ago

CORRECTION: the above is the data that is SAVED to files, but not the data that is actually used for further scraping (typical - such bad code structure).

For the data that he keeps using, he is doing the opposite and taking the LAST trip - but it does not matter since he only takes the trip_id anyway to feed to the scraper?

linkcharger commented 2 years ago

Thanks. Question: If I post an ad in the morning and I accept a passenger in the afternoon, the dataset will only contain the first observation so no indication of the passenger?

There will be no indication of passenger, that is correct. This is both because of how Davood actually uses the data and because of the fact that in the 5x daily scrapes there is no information on passengers anyway.

The only reason for the 5x daily scrapes is to get the time of posting the trip. Nothing else.

linkcharger commented 2 years ago

A note on the current setup of Davood's code and what it means for the data we currently have:

In the version of the code I have received, Davood configured the thumbnail parser such that only reviewer thumbnails are downloaded - even though the code has the functionally to automatically download all three kinds: drivers, passengers and reviewers.

It may be totally innocent and maybe he was just scraping drivers first, then separately passengers and then finally reviewers (and I now see the last state of the code where only reviewers are getting scraped). But I thought it might be good to bring this up, to check that some groups arent over or under-represented because of this configuration.

linkcharger commented 2 years ago

A note on the current setup of Davood's code and what it means for the data we currently have:

In the version of the code I have received, Davood configured the thumbnail parser such that only reviewer thumbnails are downloaded - even though the code has the functionally to automatically download all three kinds: drivers, passengers and reviewers.

It may be totally innocent and maybe he was just scraping drivers first, then separately passengers and then finally reviewers (and I now see the last state of the code where only reviewers are getting scraped). But I thought it might be good to bring this up, to check that some groups arent over or under-represented because of this configuration.

Add to this a distinct but maybe related issue: in the folder with all the profile pictures, there are 'only' about 18k - nothing close to the 800k that you mention in your data.

Is this because this is only the 'latest' tranche of profile pictures, or did the rest get lost somewhere?

linkcharger commented 2 years ago

The code-cleanup was just finished and merged (#4 ) into the trunk. I tested the parts that I could test - if we get the proxies back we can test downloading things from the API too.

Up next:

now finish the schematic of the work and dataflow,
write a full documentation
write code to download missing profile pictures and calculate ethnicities
write code to get average ratings for each trip.

linkcharger commented 2 years ago

Some questions that I have, probably specifically for @EmilPalikot:

What is the final-final table that you use for the regressions? Because on our side, stage 07 is the latest, but here we just created three separate files of drivers-passengers-reviewers and their ethnicity compositions, so the unit of analysis is the person. But from what I understood from @rdurante78, the unit of analysis is actually the trip - yet I cannot see such a datafile anywhere. Did you construct it yourself?

Another issue is the number of individuals: the number of profile pictures that we downloaded (all that Davood had) is around 500k. Yet Emil said you have about 800k - do we still have the remaining 300k or are they lost?

EmilPalikot commented 2 years ago

The unit of the analysis is the trip. The problematic part is when we are using the information from reviews as a covariate for a driver that appears multiple times, because we don't know for sure to which trip the reviews correspond. The way I was dealing with this issue is to consider drivers last trip in the regressions, this way we know that the reviews cannot be from future trips.

We should go over the data sizes on the call. I have the following (might not be the last version):

1366582 unique trip ids
199376 unique drivers, with 41K missing ethnicities
205305 unique passengers, with 76K missing ethnicities

linkcharger / blablacar

Reverse-engineer David's workflow #2

Meeting with Emil

rerun whole workflow only if important variables (automatic acceptance, ratings) missing