Open linkcharger opened 2 years ago
@linkcharger Email Davood to tell him you are going through his mess of files and codes and that you are going to write him to ask clarifications, and that I expect him to reply quickly.
Status update:
We think we figured out about the first third of the workflow. We read and understand the first bit of code that runs 5x a day to get the trips posted on the website, including some minimal information about those trips. The respective code and data files we put into the folder '01_scrape_trips'. There is a part of the code that downloads data and saves the raw data as JSON files, and there is a part which cleans this data up and converts it into CSV tables.
The issues with this first step are the following:
Here is the overleaf document where the final complete workflow will be elaborated.
We will continuously add to this as we go along.
missing variables:
variable name issues:
@linkcharger Small Update about the synching process:
It turns out that the synching process takes even longer for me than though. Dropbox displays that the files were synched today at 6.41 am. Maybe you're approach from yesterday is not only beneficial but also needed.
Small Update: (for Meeting with Emil)
Regarding automatic acceptance named here we can see something in the raw_JSON_dumps txt-files named "approval mode"
which can be seen as "MANUAL"
or "AUTOMATIC"
Regarding ratings named in the same comment, we can identify "rating"
in the JSON dumps which displays the overall ratings as "overall"
and the total number of ratings as "total number"
"global rating"
(from 1 to 5) and the "comment"
of the individual giving that rating (Additionally, the individual is categorised with a "role"
that can be either "passenger"
or "driver"
. The latter is given when a passenger in this ride is himself a driver in the database - not 100% sure yet.) the driver bio cannot be identified in the JSON dumps for now (will update this further along the way)
num_id is only the numerical part of the trip_id
The file currently named 'create_day_drips_ethnicities.py' does the following:
Does automatic acceptance change over time (much)?
@EmilPalikot We just talked with Ruben: Next week we will extract the automatic/manual acceptance variable together with the trip_id and driver_id in a quick and dirty way. Do you want also the average rating and name of the reviewers?
@EmilPalikot How long is the list of pictures we were not able to download?
@linkcharger is the list that I send over the e-mail useful? I think I can match it to ride ids with that helps
Firstly of note is that this is a list of missing ethnicities, which thus encapsulates both missing pictures and missing classifications.
Next, it does seem like most of the missing pictures are 'missing' in the sense that they have/had the default profile picture, with a default URL. That makes me believe that even if we are to find the user based on their ID, we could still be missing a significant portion of pictures for the same reason.
Finally, it will take a bit more time until I have the code ready to take only the ID as input and spit out the pictures and ethnicities. You can follow the progress on #3.
Hey guys, I just noticed I haven't posted an Update in here about the Quick-n-Dirty Solution for extracting additional Information. That's why here a full update.
Sitrep
Firstly, I have extracted the majority of the information requested from the raw-JSON-files that we have given (from David's raw data)
Concerning the missing values I have checked the raw-JSON-data and these missing values might be from rides that were deleted on the blablacar page (for reasons like no sign-ups or other things) and was still scraped by David's code
Here I would need to ask @EmilPalikot some questions: Do you have information in the data David sent you on the trips that you see with missing values in my data set or has he deleted them and they're just not in the data set? Because I have checked David's raw JSON files and for these trips there is just the ID but no additional information at all.
When I have extracted all the information needed I will send a version of the data with the missing values and without and then we can see from there
As for how Davood created the datasets, here is the first part: for the trips (only basic informatin) that are scraped 5x a day.
This would mean that if a trip is observed multiple times in one day, only the first instance is preserved. This makes sense (giving us the earliest time the trip was seen). So far so good.
This does not yet answer how trips that are seen on multiple days are dealt with. I will get to that shortly.
Thanks. Question:
If I post an ad in the morning and I accept a passenger in the afternoon, the dataset will only contain the first observation so no indication of the passenger?
Ruben Durante Full Professor, Universitat Pompeu Fabra Visiting Professor, PSE & INSEAD BSE, IPEG, CEPR, CESifo and IZA www.rubendurante.net https://us02web.zoom.us/my/rubendurante
On Tue, Sep 27, 2022 at 2:10 PM Aaron @.***> wrote:
As for how Davood created the datasets, here is the first part: for the trips (only basic informatin) that are scraped 5x a day.
- he downloads ALL search results for a given search (a city pair) on a given day
- he puts obsercations of one day in one table
- he drops trips which have no (alphanumeric) trip_id
- he sorts the remaining trips by the numeric trip_id and the 'day_counter' variable (ascending)
- he drops duplicates according to the rows 'num_id', 'DeptNum', 'destination', keeping only the first observation (in the order previously sorted)
This would mean that if a trip is observed multiple times in one day, only the first instance is preserved. This makes sense (giving us the earliest time the trip was seen). So far so good.
This does not yet answer how trips that are seen on multiple days are dealt with. I will get to that shortly.
— Reply to this email directly, view it on GitHub https://github.com/linkcharger/blablacar/issues/2#issuecomment-1259414065, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYCTO6VLWGWL5CPBMEWZSLWALP4TANCNFSM6AAAAAAQHYQ7LM . You are receiving this because you were assigned.Message ID: @.***>
CORRECTION: the above is the data that is SAVED to files, but not the data that is actually used for further scraping (typical - such bad code structure).
For the data that he keeps using, he is doing the opposite and taking the LAST trip - but it does not matter since he only takes the trip_id anyway to feed to the scraper?
Thanks. Question: If I post an ad in the morning and I accept a passenger in the afternoon, the dataset will only contain the first observation so no indication of the passenger?
There will be no indication of passenger, that is correct. This is both because of how Davood actually uses the data and because of the fact that in the 5x daily scrapes there is no information on passengers anyway.
The only reason for the 5x daily scrapes is to get the time of posting the trip. Nothing else.
A note on the current setup of Davood's code and what it means for the data we currently have:
In the version of the code I have received, Davood configured the thumbnail parser such that only reviewer thumbnails are downloaded - even though the code has the functionally to automatically download all three kinds: drivers, passengers and reviewers.
It may be totally innocent and maybe he was just scraping drivers first, then separately passengers and then finally reviewers (and I now see the last state of the code where only reviewers are getting scraped). But I thought it might be good to bring this up, to check that some groups arent over or under-represented because of this configuration.
A note on the current setup of Davood's code and what it means for the data we currently have:
In the version of the code I have received, Davood configured the thumbnail parser such that only reviewer thumbnails are downloaded - even though the code has the functionally to automatically download all three kinds: drivers, passengers and reviewers.
It may be totally innocent and maybe he was just scraping drivers first, then separately passengers and then finally reviewers (and I now see the last state of the code where only reviewers are getting scraped). But I thought it might be good to bring this up, to check that some groups arent over or under-represented because of this configuration.
Add to this a distinct but maybe related issue: in the folder with all the profile pictures, there are 'only' about 18k - nothing close to the 800k that you mention in your data.
Is this because this is only the 'latest' tranche of profile pictures, or did the rest get lost somewhere?
The code-cleanup was just finished and merged (#4 ) into the trunk. I tested the parts that I could test - if we get the proxies back we can test downloading things from the API too.
Up next:
Some questions that I have, probably specifically for @EmilPalikot:
What is the final-final table that you use for the regressions? Because on our side, stage 07 is the latest, but here we just created three separate files of drivers-passengers-reviewers and their ethnicity compositions, so the unit of analysis is the person. But from what I understood from @rdurante78, the unit of analysis is actually the trip - yet I cannot see such a datafile anywhere. Did you construct it yourself?
Another issue is the number of individuals: the number of profile pictures that we downloaded (all that Davood had) is around 500k. Yet Emil said you have about 800k - do we still have the remaining 300k or are they lost?
The unit of the analysis is the trip. The problematic part is when we are using the information from reviews as a covariate for a driver that appears multiple times, because we don't know for sure to which trip the reviews correspond. The way I was dealing with this issue is to consider drivers last trip in the regressions, this way we know that the reviews cannot be from future trips.
We should go over the data sizes on the call. I have the following (might not be the last version):
This page should serve as a record for what we find out while we go through Davood's mess of files and codes. We write down progressively what things we figure out, so that in the end, when someone reads through this, they should understand the work and data flow. For a neat documentation, this should still be restructured, but at least all the insights we had are collected here.