mwpenny / kijiji-scraper

A lightweight node.js module for retrieving and scraping ads from Kijiji
MIT License
96 stars 44 forks source link

Missing very new ads when using the Mobile API #57

Closed MaximumPotato closed 3 years ago

MaximumPotato commented 3 years ago

Issue: Ads aren't retrieved while using the Mobile API scrape method for up to a minute after they are posted.

Details: I have been playing around with this module and have noticed an odd behavior while testing. Lets say we want to write to the console every time a new ad is posted. To do this, we compare the time of our last scrape against the value of the date stored in each ad object retrieved by the scraper. If any ads were posted after our last scrape time, notify the user about them.

The issue here is that, at least when using the Mobile API, certain ads won't be picked up for up to a minute after their post date found in the Ad object. If we change the above example around a little bit:

There is a good chance this ad was not actually picked up during the 5:10pm scan. At that point the "last scan" value would have been something earlier, say 5:05pm, and the ad would have been seen as new. As the ad was not actually picked up during that scrape, its corresponding Ad object is not available to be examined at 5:10pm. Our current scrape at, lets say, 5:15pm picks up the ad, but at this point it is considered to be old since it was posted prior to the last time we ran a scan. The user never finds out about this new posting.

Notes: I have yet to test this using the http scraper, that is next on the list. If the behavior is consistent I believe there would be two possible culprits:

  1. Ads just don't show up on the website for up to a minute after a user clicks post. This is the "easy" answer as it's out of our hands and just needs to be worked around.
  2. For some reason the module is ignoring or removing posts which are "too new". This feels pretty unlikely.

Of course if using the http scraper fixes the issue, then the problem would either be found in the Mobile API or code specific to its handling within the module.

I will update if I find out more.

mwpenny commented 3 years ago

Interesting, thanks for the detailed explanation.

Some context - the mobile API returns 4 dates for each ad:

  1. Creation date
  2. Modification date
  3. Start date
  4. End date

I just did a little testing with some different ads. For most ads which have not been modified after posting, start date was always within a few seconds of modification date. On ads which have likely been up for a while (posted by companies and refreshed periodically, such as apartment rentals) creation date usually differed significantly from start date, whereas on ads posted by individuals creation date and start date were the same. End date depends on ad category and is usually 60 days after creation date, but on "company" ads I've seen it returned as midnight on 01/01/3000 (i.e., never expire). See https://help.kijiji.ca/helpdesk/basics/how-long-are-ads-active for more information about ad expiry.

This module uses creation date when scraping, so what you could be running into are "refreshed" ads which have been up for a while but were re-posted/moved up in the listings at a later date. Kijiji has a "bump up" feature and it's possible that it creates a gap between creation date and start date ("the Bump Up feature will reset the post date of your ad").

I'm curious - which kind of ads are you running into this problem with? Are they apartment rentals by any chance, and could you provide a few examples so I can look further? Also, please let me know how it goes with the HTML scraper. Another thing you can try is using the actual mobile app. It sounds like I need to switch to using start date but I'm not 100% certain of the exact semantics of the different values and would like some additional data to investigate further to be certain.

MaximumPotato commented 3 years ago

Alright so just to re-iterate what you've said and ensure I understand correctly.

Kijiji Mobile API returns 4 dates for each ad:

This module uses Creation Date as the 'Date' value for an ad.

Bump Up:

So we have 3 dates that indicate something was done to the ad, and a single date we don't really care about telling us when the ad will die. I'd like to see the documentation for the Mobile API if it exists. I did a little looking last week and couldn't find anything.

For your questions, I have been running searches for very common words while working so that I always have new ads coming in. The primary two search terms used are "new" and "phone". I could provide you some examples, but there really isn't any rhyme or reason to the postings that I can see. Mostly just random items that contain the search terms.

I can try setting my search frequency to 60 seconds or so and searching manually on the mobile app every minute til something new comes up, to see if the results match on each platform at any given time. First I am going to try using the HTML scraper like I said since I've got some free time. I'll post the results shortly.

MaximumPotato commented 3 years ago

So perhaps I'm doing something wrong, but when I provide the following: Parameters:

Options:

The search term submitted is "undefined". It only picks up a search term if I define "q" as I would if I were using the API scanning type, which is not used in the HTML Scraper according to the readme.

mwpenny commented 3 years ago

Your understanding is mostly correct based on what I've seen except I'm not 100% sure of the effect of "bump up" on creation date and modification date, in addition to the uncertainty around start date. It's also unclear whether or not non-corporate accounts can change start date.

Keep in mind that my sample size is rather small (basically just me scraping different ads and looking at the API responses). My working theory is that start date is the main date that the site uses, and the date that I should use as well. This would mean that "bump up" simply sets the start date to the current date, but I haven't confirmed this with the examples I've tried. The benefit of you providing examples where you actually saw the problem (and, if possible, the corresponding timestamps when you scraped and failed to retrieve the ad due to timestamps) is that I can then query the API and check the 4 date values to confirm or reject my theory.

As for documentation, there's no official documentation for the mobile API which is part of the problem. I learned what I know by monitoring the communication between the mobile app and Kijiji servers. Here is where I send the HTTP request to retrieve a single ad: https://github.com/mwpenny/kijiji-scraper/blob/09dcb5e1552296d723178969832455ba862def5d/lib/backends/api-scraper.ts#L122

If you're curious about the API, rather than monitoring communication using the app, which is a little involved (I had to rebuild it to get it to read my own self-signed certificate), it should be much easier to just monitor traffic between kijiji-scraper and Kijiji - or you could just use the Node.js debugger. Here is an example API response for this apartment listing: response.txt (GitHub doesn't let me attach XML, so I've made it a text file).

Notice the discrepancy between dates:

<ad:creation-date-time>2020-06-11T03:30:10.000Z</ad:creation-date-time>
<ad:modification-date-time>2021-03-29T13:01:44.000Z</ad:modification-date-time>
<ad:start-date-time>2021-03-29T13:01:44.000Z</ad:start-date-time>
<ad:end-date-time>3000-01-01T00:00:00.000Z</ad:end-date-time>

An example provided by you will help me know if I need to use modification date or start date (in this example they are the same).

mwpenny commented 3 years ago

As for your failed search, I was just able to search successfully with the following:

const kijiji = require("kijiji-scraper");

let params = {
    locationId: kijiji.locations.ONTARIO,
    categoryId: kijiji.categories.BUY_AND_SELL,
    sortByName: "dateDesc",
    keywords: "bicycle"
};

let options = {
    minResults: 1,
    maxResults: 10,
    scraperType: kijiji.ScraperType.HTML  // same as "html"
};

kijiji.search(params, options).then(function(ads) {
    console.log(ads.length);
    for (const ad of ads) {
        console.log(ad.title);
    }
}).catch(console.error);

Output:

10
Bike
Girls bike - Nakamura Meyou 20" - Used
4 bicycle bike rack
Girls 14-16 inch bike
Wicked fallout plus mens bike
Trek Domane 56 CM” Di2 Carbon Road Bike
Hybrid devinci St-Tropez 400$
For sale - gently used Nordic Track spin bike
Racing Bike Specialized Roubaix Expert Ultegra
Vintage Sekine men’s bike for sale

Where were you seeing undefined? And using q instead of keywords worked?

MaximumPotato commented 3 years ago

With regards to the failed search, I had my own little logging setup running which I've realized was just, reading me the value of q. Instead of whatever was actually being sent out as search terms. I think I'll get the node debugger running.

Gunna go get some ice cream while this thing runs for a bit, it should have some examples when I'm back.

MaximumPotato commented 3 years ago

Scan is run at 9:43.00PM

PICTONIANS IN ARMS by J. M. Cameron – 1969 - 2020-12-02 11:39.09 p.m. Tablet case 8.9"-10" - 2020-12-04 11:30.12 p.m. Gummi Crib rail cover - 2020-12-04 5:25.11 a.m. Bugatti Divo - 2021-03-29 9:38.19 p.m. Brasil ball cap - 2020-12-04 5:28.26 a.m. Ladies retro boot - 2020-12-04 5:12.45 a.m. Lounge Chair - New condition - 2021-03-29 9:25.05 p.m. Daphne’s Leprechaun Head Cover - 2021-03-29 9:23.00 p.m. 1000 piece puzzle - 2021-03-29 9:21.11 p.m. Nike DNA pack 2.0 (2 pairs) - 2021-03-29 9:20.12 p.m.

Scan is run at 9:43.32PM

Sony digital frame (Brand new) - 2021-03-29 9:42.24 p.m. PICTONIANS IN ARMS by J. M. Cameron – 1969 - 2020-12-02 11:39.09 p.m. Tablet case 8.9"-10" - 2020-12-04 11:30.12 p.m. Gummi Crib rail cover - 2020-12-04 5:25.11 a.m. Bugatti Divo - 2021-03-29 9:38.19 p.m. Brasil ball cap - 2020-12-04 5:28.26 a.m. Ladies retro boot - 2020-12-04 5:12.45 a.m. Lounge Chair - New condition - 2021-03-29 9:25.05 p.m. Daphne’s Leprechaun Head Cover - 2021-03-29 9:23.00 p.m. 1000 piece puzzle - 2021-03-29 9:21.11 p.m.

The ad Sony digital frame (Brand new) was posted at 9:42.24pm today. This is before either scan took place.

Oh, and this was using the HTML scraper. That would lead me to believe the ad wasn't visible at the time of it post Date.

mwpenny commented 3 years ago

Glad to hear the searching is working, and thanks for the example. Here are the dates:

<ad:creation-date-time>2021-03-30T00:42:24.000Z</ad:creation-date-time>
<ad:modification-date-time>2021-03-30T00:42:25.000Z</ad:modification-date-time>
<ad:start-date-time>2021-03-30T00:42:24.000Z</ad:start-date-time>
<ad:end-date-time>2021-05-29T00:42:24.000Z</ad:end-date-time>

So the Kijiji website HTML uses start date (modification date is a second ahead of what was scraped and examples above confirm it's definitely not creation date).

I think the multiple dates have been a red herring then - it looks like your first explanation was correct and there is a short delay between when the ad is posted and when it becomes visible, even on the website itself. I'll update the readme to mention this, and also switch to using start date for the API-based scraper.

mwpenny commented 3 years ago

Updated (PR #58), thanks for reporting this. Very nuanced.

As for your project, rather than using the scrape time to determine if an ad is new, one alternative is to save the URL in a set and then compare URLs of potentially new ads against that set. If you want to avoid unbounded memory growth, you'd still need to look at timestamps to purge old ads but it would be more robust. Anyway, there's more than one way to go.

MaximumPotato commented 3 years ago

Glad we got that nailed down, although I'm not 100% sure what you meant in that second to last post, first paragraph. (Kijiji website HTML uses start date... ...and the examples above confirm it's definitely not start date)?

I appreciate the suggestion, I was just mulling over how to determine what is new when it popped up. Have you any idea if URL's change when an ad is updated/bumped?

mwpenny commented 3 years ago

Whoops, typo. I meant to say that the examples above confirm it's definitely not creation date (namely, the apartment listings). I've edited my original comment.

Public-facing URLs contain the ad category, location, and ID. Ad IDs are unique and the location and category can't be changed after posting (see this Kijiji help page). So the URLs shouldn't change. I also doubt Kijiji would do that since anybody could have saved the URL after posting. Though this makes me think that it might be worth making the unique ID easily available on Ad objects regardless. I'll look into it when I get some more free time. Let me know if you run into trouble using URLs in the mean time.

MaximumPotato commented 3 years ago

Thanks for clearing that up, I was pretty confused.

I just did a little test regarding the URL's. I posted an ad titled "New Good Phone" with some bogus information and grabbed the URL, which ended in "new-good-phone/1558341164". I then changed the ad title to "New Bad Phone" and grabbed the URL. The end had changed to "new-bad-phone/1558341164". Visiting the ad thru the original URL still worked, redirecting you to the proper URL with the new title. I then got curious and tried to visit "...asdjwjiajsfhaheiuwf/1558341164" and that also redirected me to the ad.

From there I started removing random chunks of the URL and it seems that if you enter https://kijiji.ca/v-/YourAdID you will be redirected to the ad corresponding to that ID. Those are the only parts of the URL that matter. Interestingly you cannot remove or alter the "v-" bit, that must be the first bit of text after kijiji.ca/.

All that said I think making the unique ID available directly would be super useful considering it's a vital piece of information.

No trouble using URLs so far, it's far more reliable than the old method.

mwpenny commented 3 years ago

Good find! It makes sense that most of the URL doesn't matter since an ad's ID uniquely identifies it. The rest of the URL is to make it easier for humans to read at a glance. I found that you can also do https://www.kijiji.ca/v-view-details.html?adId=YourAdId. It looks like they're pretty relaxed about URLs.

It also makes sense that they redirect the old URL to the new one, however I can see this being a problem when searching if the URL is what is used to check if the ad has been seen before (granted, I don't think ad titles change very often). I'll expose an id property on Ad objects.

mwpenny commented 3 years ago

Exposed ad ID in PR #59. The latest version is on NPM. It should be easier for you to track seen ads now.

MaximumPotato commented 3 years ago

@mwpenny I appreciate the help, thanks!