godfriedmeesters / scraper

As part of DiffScraper, one or more bots can be deployed. Ready-to-use bots are provided that can extract offers from mobile applications, mobile websites and desktop websites.
GNU General Public License v3.0
2 stars 0 forks source link

Add more data points to searches #5

Open bkrumnow opened 3 years ago

bkrumnow commented 3 years ago

Here are a couple of input data that we would like to add (keep in mind that some of these may lead to prices shown in different currencies).

Booking:

Kayak:

Opodo:

Expedia:

Airfrance:

EuroWings:

godfriedmeesters commented 3 years ago

For booking.com only hotel reservations are supported. So need location + checkindate

godfriedmeesters commented 3 years ago

Its seems these data points have not been checked manually. For example, there are no flights available with Eurowings in May from PAR to HAM.

godfriedmeesters commented 3 years ago

With the new corona wave maybe more flights will be canceled. To be safe maybe we need to focus on data >= June.

godfriedmeesters commented 3 years ago

Added comparisons with new datapoints.

If you want to add more, first check if flights are available, and run ts-node cli.ts scrape to make sure it doesnt crash for the given date.

bkrumnow commented 3 years ago

Its seems these data points have not been checked manually. For example, there are no flights available with Eurowings in May from PAR to HAM.

Actually, I did...Let's divide these between us and make sure the list provides valid input data. More on that tomorrow

godfriedmeesters commented 3 years ago

I already added more input data, see pgadmin.

For EuroWings, only flight dates that return one result are allowed.

bkrumnow commented 3 years ago

I wanted to check the data online, but it seems that I don't posses the right to execute queries or see any databases. Could you check it?

godfriedmeesters commented 3 years ago

just checked with your username, I can do select on every table and also updates

godfriedmeesters commented 3 years ago

updated the input data for EuroWings, suddenly there were no more flights

godfriedmeesters commented 3 years ago

Its really terrible flights keep disappearing randomly https://scraperbox.be/screenshots/EuroWingsWebScraper-1617692472681.png

bkrumnow commented 3 years ago

Let them expire. We will update them short before we start the data collection

bkrumnow commented 3 years ago

I check

I already added more input data, see pgadmin.

For EuroWings, only flight dates that return one result are allowed. The list online is still pretty incomplete Let's take 3 different searches for each comparison, e.g. for booking.com, we want to have for each comparison (1. mobile vs desktop, 2. France vs German site, ...) the same input data: Madrid (MAD) - Warsaw (WAW), June, 14th 2021 Bordeaux (BOD) - Rome (FCO), May, 6th 2021 Porto (OPO) - Berlin (BER), May, 13th 2021 So, we do not rely on one single data point that may be flawed

bkrumnow commented 3 years ago

Let's add the following connections:

Kayak Berlin (BER) - Barcelona (BCN), 13.08.2021 Madrid (MAD) - ROM (FCO), 07.08 2021 { "origin": "BER", "destination": "BCN", "departureDate": "2021-08-13" } { "origin": "MAD", "destination": "FCO", "departureDate": "2021-08-07" }

Booking: Bordeaux (BOD) - Rome, 13.08.2021 Porto (OPO) - Berlin (BER), 25.08.2021

Opodo: Cologne (CGN) - Prague (PRG), 23.08.2021 Porto (OPO) - Brussels (BRU), 18.08.2021 { "origin": "CGN", "destination": "PRG", "departureDate": "2021-08-23" } { "origin": "OPO", "destination": "BRU", "departureDate": "2021-08-18" } Expedia: Stockholm (ARN) - Amsterdam (AMS), 10.08.2021 Porto (OPO) - Brussels (BRU), 25.08.2021 { "origin": "AMS", "destination": "ARN", "departureDate": "2021-08-10" } { "origin": "OPO", "destination": "BRU", "departureDate": "2021-08-18" }

Airfrance: Madrid (MAD) - Paris (PAR), 25.08.2021 Vienna (VIE) - Amsterdam (AMS), 09.08.2021 { "origin": "MAD", "destination": "PAR", "departureDate": "25.08.2021" } { "origin": "VIE", "destination": "AMS", "departureDate": "2021-08-09" }

EuroWings: Cologne (CGN) - London (LON), 12.08.2021 Berlin (BER) - ROM (FCO), 23.08.2021 { "origin": "CGN", "destination": "LON", "departureDate": "2021-08-12" } { "origin": "BER", "destination": "FCO", "departureDate": "2021-08-23" }

bkrumnow commented 3 years ago

We need a second version of booking.com that provides flights

godfriedmeesters commented 3 years ago

See comparison table for the new additins

bkrumnow commented 3 years ago

@godfriedmeesters Last things needed:

  1. Flights for booking.com. Is there a scraper somewhere that does this stuff already. I am also happy to do both for booking.
  2. A new dataset so that we can check if the new input data leads to sufficient results. You can just dump a new csv export of the db into skype, I will do the analysis
godfriedmeesters commented 3 years ago

If you want to do the scraper for Booking flights, I guess best to look at BookingWebScraper.ts which scrapes only hotel rooms.

About the dataset, corrected a bug where apps returned duplicate offers, hopefully we have a good dataset next week.

godfriedmeesters commented 3 years ago

Booking Flights is much more difficult than hotel offers. Only possible to query by xpaths.

07:05
BRU
.
Jul 01
55 Min.

Direkt
08:00
AMS
.
Jul 01
KLM, durchgeführt von KLM Cityhopper
223,99 €
Insgesamt

XPATH selector works in chrome but not in puppeteer //div[@data-testid='searchresults_card']//*[contains(text(),'€')]
gives prices in chrome

however this gives wrong prices let elements = await this.page.$x("//div[@data-testid='searchresults_card']//*[contains(text(),'€')]"); var txts = []; for (var elem of elements) { await this.page.waitFor(100); const price = await this.page.evaluate(el => el.textContent, elem); console.log(price); }

OUTPUT: 136,07 € 136,07 € 136,07 € [] []

godfriedmeesters commented 3 years ago

I give up on Booking flights, too difficult

godfriedmeesters commented 3 years ago

For example try https://flights.booking.com/flights/BRU-AMS/?type=ONEWAY&adults=1&cabinClass=ECONOMY&children=&from=BRU&to=AMS&fromCountry=BE&toCountry=NL&fromLocationName=Br%C3%BCssel&toLocationName=Flughafen+Schiphol&depart=2021-07-01&sort=BEST&aid=304142&label=gen173nr-1DCAEoggI46AdIM1gEaBWIAQGYAQe4ARfIAQzYAQPoAQGIAgGoAgO4AtvCkIUGwAIB0gIkNzRkZmRhYTUtNTNiOC00YmJmLWEyMzEtZDdmN2U3ZDBlN2M12AIE4AIB&stops=0

In Chrome devtools query //div[@data-testid='searchresults_card']//*[contains(text(),'€')] works well However, in puppeteer seems very tricky

bkrumnow commented 3 years ago

@godfriedmeesters As discussed, we do not want to delay progress much further. Could please add two more distinct cities and dates, so that end up we three ( a. 3x mobile vs. web + b. 3x web vs. web; same input data for a and b) comparisons for booking.com?

godfriedmeesters commented 3 years ago

Added new comparisons with different data.

Also changed the order in the comparisons table, so the same company will not be scraped consecutively

bkrumnow commented 3 years ago

Added new comparisons with different data.

Also changed the order in the comparisons table, so the same company will not be scraped consecutively

Good move. I like it