linkcharger / blablacar

0 stars 0 forks source link

Reverse-engineer David's workflow #2

Open linkcharger opened 2 years ago

linkcharger commented 2 years ago

This page should serve as a record for what we find out while we go through Davood's mess of files and codes. We write down progressively what things we figure out, so that in the end, when someone reads through this, they should understand the work and data flow. For a neat documentation, this should still be restructured, but at least all the insights we had are collected here.

rdurante78 commented 2 years ago

@linkcharger Email Davood to tell him you are going through his mess of files and codes and that you are going to write him to ask clarifications, and that I expect him to reply quickly.

linkcharger commented 2 years ago

Status update:

We think we figured out about the first third of the workflow. We read and understand the first bit of code that runs 5x a day to get the trips posted on the website, including some minimal information about those trips. The respective code and data files we put into the folder '01_scrape_trips'. There is a part of the code that downloads data and saves the raw data as JSON files, and there is a part which cleans this data up and converts it into CSV tables.

The issues with this first step are the following:

linkcharger commented 2 years ago

Here is the overleaf document where the final complete workflow will be elaborated.

We will continuously add to this as we go along.

linkcharger commented 2 years ago

Meeting with Emil

missing variables:


variable name issues:


sebbaehralarcon commented 2 years ago

@linkcharger Small Update about the synching process:

It turns out that the synching process takes even longer for me than though. Dropbox displays that the files were synched today at 6.41 am. Maybe you're approach from yesterday is not only beneficial but also needed.

sebbaehralarcon commented 2 years ago

Small Update: (for Meeting with Emil)

linkcharger commented 2 years ago

num_id is only the numerical part of the trip_id

linkcharger commented 2 years ago

The file currently named 'create_day_drips_ethnicities.py' does the following:

linkcharger commented 2 years ago

Does automatic acceptance change over time (much)?

linkcharger commented 2 years ago

@EmilPalikot We just talked with Ruben: Next week we will extract the automatic/manual acceptance variable together with the trip_id and driver_id in a quick and dirty way. Do you want also the average rating and name of the reviewers?

linkcharger commented 2 years ago

@EmilPalikot How long is the list of pictures we were not able to download?

EmilPalikot commented 2 years ago

@linkcharger is the list that I send over the e-mail useful? I think I can match it to ride ids with that helps

linkcharger commented 2 years ago

Firstly of note is that this is a list of missing ethnicities, which thus encapsulates both missing pictures and missing classifications.

Next, it does seem like most of the missing pictures are 'missing' in the sense that they have/had the default profile picture, with a default URL. image That makes me believe that even if we are to find the user based on their ID, we could still be missing a significant portion of pictures for the same reason.

Finally, it will take a bit more time until I have the code ready to take only the ID as input and spit out the pictures and ethnicities. You can follow the progress on #3.

sebbaehralarcon commented 2 years ago

Hey guys, I just noticed I haven't posted an Update in here about the Quick-n-Dirty Solution for extracting additional Information. That's why here a full update.

Sitrep

linkcharger commented 2 years ago

As for how Davood created the datasets, here is the first part: for the trips (only basic informatin) that are scraped 5x a day.

  1. he downloads ALL search results for a given search (a city pair) on a given day, 5x a day
  2. at the end of the day, he puts observations of one day in one table
  3. he drops trips which have no (alphanumeric) trip_id
  4. he sorts the remaining trips by the numeric trip_id and the 'day_counter' variable (ascending)
  5. he drops duplicates according to the rows 'num_id', 'DeptNum', 'destination', keeping only the first observation (in the order previously sorted)

This would mean that if a trip is observed multiple times in one day, only the first instance is preserved. This makes sense (giving us the earliest time the trip was seen). So far so good.

This does not yet answer how trips that are seen on multiple days are dealt with. I will get to that shortly.

rdurante78 commented 2 years ago

Thanks. Question:

If I post an ad in the morning and I accept a passenger in the afternoon, the dataset will only contain the first observation so no indication of the passenger?


Ruben Durante Full Professor, Universitat Pompeu Fabra Visiting Professor, PSE & INSEAD BSE, IPEG, CEPR, CESifo and IZA www.rubendurante.net https://us02web.zoom.us/my/rubendurante

On Tue, Sep 27, 2022 at 2:10 PM Aaron @.***> wrote:

As for how Davood created the datasets, here is the first part: for the trips (only basic informatin) that are scraped 5x a day.

  1. he downloads ALL search results for a given search (a city pair) on a given day
  2. he puts obsercations of one day in one table
  3. he drops trips which have no (alphanumeric) trip_id
  4. he sorts the remaining trips by the numeric trip_id and the 'day_counter' variable (ascending)
  5. he drops duplicates according to the rows 'num_id', 'DeptNum', 'destination', keeping only the first observation (in the order previously sorted)

This would mean that if a trip is observed multiple times in one day, only the first instance is preserved. This makes sense (giving us the earliest time the trip was seen). So far so good.

This does not yet answer how trips that are seen on multiple days are dealt with. I will get to that shortly.

— Reply to this email directly, view it on GitHub https://github.com/linkcharger/blablacar/issues/2#issuecomment-1259414065, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYCTO6VLWGWL5CPBMEWZSLWALP4TANCNFSM6AAAAAAQHYQ7LM . You are receiving this because you were assigned.Message ID: @.***>

linkcharger commented 2 years ago

CORRECTION: the above is the data that is SAVED to files, but not the data that is actually used for further scraping (typical - such bad code structure).

For the data that he keeps using, he is doing the opposite and taking the LAST trip - but it does not matter since he only takes the trip_id anyway to feed to the scraper?

linkcharger commented 2 years ago

Thanks. Question: If I post an ad in the morning and I accept a passenger in the afternoon, the dataset will only contain the first observation so no indication of the passenger?

There will be no indication of passenger, that is correct. This is both because of how Davood actually uses the data and because of the fact that in the 5x daily scrapes there is no information on passengers anyway.

The only reason for the 5x daily scrapes is to get the time of posting the trip. Nothing else.

linkcharger commented 2 years ago

A note on the current setup of Davood's code and what it means for the data we currently have:

In the version of the code I have received, Davood configured the thumbnail parser such that only reviewer thumbnails are downloaded - even though the code has the functionally to automatically download all three kinds: drivers, passengers and reviewers.

It may be totally innocent and maybe he was just scraping drivers first, then separately passengers and then finally reviewers (and I now see the last state of the code where only reviewers are getting scraped). But I thought it might be good to bring this up, to check that some groups arent over or under-represented because of this configuration.

linkcharger commented 2 years ago

A note on the current setup of Davood's code and what it means for the data we currently have:

In the version of the code I have received, Davood configured the thumbnail parser such that only reviewer thumbnails are downloaded - even though the code has the functionally to automatically download all three kinds: drivers, passengers and reviewers.

It may be totally innocent and maybe he was just scraping drivers first, then separately passengers and then finally reviewers (and I now see the last state of the code where only reviewers are getting scraped). But I thought it might be good to bring this up, to check that some groups arent over or under-represented because of this configuration.

Add to this a distinct but maybe related issue: in the folder with all the profile pictures, there are 'only' about 18k - nothing close to the 800k that you mention in your data.

Is this because this is only the 'latest' tranche of profile pictures, or did the rest get lost somewhere?

linkcharger commented 2 years ago

The code-cleanup was just finished and merged (#4 ) into the trunk. I tested the parts that I could test - if we get the proxies back we can test downloading things from the API too.

Up next:

  1. now finish the schematic of the work and dataflow,
  2. write a full documentation
  3. write code to download missing profile pictures and calculate ethnicities
  4. write code to get average ratings for each trip.
linkcharger commented 2 years ago

Some questions that I have, probably specifically for @EmilPalikot:

What is the final-final table that you use for the regressions? Because on our side, stage 07 is the latest, but here we just created three separate files of drivers-passengers-reviewers and their ethnicity compositions, so the unit of analysis is the person. But from what I understood from @rdurante78, the unit of analysis is actually the trip - yet I cannot see such a datafile anywhere. Did you construct it yourself?

Another issue is the number of individuals: the number of profile pictures that we downloaded (all that Davood had) is around 500k. Yet Emil said you have about 800k - do we still have the remaining 300k or are they lost?

EmilPalikot commented 2 years ago

The unit of the analysis is the trip. The problematic part is when we are using the information from reviews as a covariate for a driver that appears multiple times, because we don't know for sure to which trip the reviews correspond. The way I was dealing with this issue is to consider drivers last trip in the regressions, this way we know that the reviews cannot be from future trips.

We should go over the data sizes on the call. I have the following (might not be the last version):