Closed MaximumPotato closed 3 years ago
Interesting, thanks for the detailed explanation.
Some context - the mobile API returns 4 dates for each ad:
I just did a little testing with some different ads. For most ads which have not been modified after posting, start date was always within a few seconds of modification date. On ads which have likely been up for a while (posted by companies and refreshed periodically, such as apartment rentals) creation date usually differed significantly from start date, whereas on ads posted by individuals creation date and start date were the same. End date depends on ad category and is usually 60 days after creation date, but on "company" ads I've seen it returned as midnight on 01/01/3000 (i.e., never expire). See https://help.kijiji.ca/helpdesk/basics/how-long-are-ads-active for more information about ad expiry.
This module uses creation date when scraping, so what you could be running into are "refreshed" ads which have been up for a while but were re-posted/moved up in the listings at a later date. Kijiji has a "bump up" feature and it's possible that it creates a gap between creation date and start date ("the Bump Up feature will reset the post date of your ad").
I'm curious - which kind of ads are you running into this problem with? Are they apartment rentals by any chance, and could you provide a few examples so I can look further? Also, please let me know how it goes with the HTML scraper. Another thing you can try is using the actual mobile app. It sounds like I need to switch to using start date but I'm not 100% certain of the exact semantics of the different values and would like some additional data to investigate further to be certain.
Alright so just to re-iterate what you've said and ensure I understand correctly.
Kijiji Mobile API returns 4 dates for each ad:
Creation Date - The date the ad was created.
Modification Date - The date the ad was last modified.
Start Date - Always the same as creation date for normal posters, can be "refreshed" to current date for corperate accounts.
End Date - The date the ad will become inactive, either as defined here or at some ridiculous future point for sponsored ads.
This module uses Creation Date as the 'Date' value for an ad.
Resets Creation Date value to current date.
Effect on Start Date is unknown.
End Date unaffected according to the bump up article.
Modification Date likely unaffected.
So we have 3 dates that indicate something was done to the ad, and a single date we don't really care about telling us when the ad will die. I'd like to see the documentation for the Mobile API if it exists. I did a little looking last week and couldn't find anything.
For your questions, I have been running searches for very common words while working so that I always have new ads coming in. The primary two search terms used are "new" and "phone". I could provide you some examples, but there really isn't any rhyme or reason to the postings that I can see. Mostly just random items that contain the search terms.
I can try setting my search frequency to 60 seconds or so and searching manually on the mobile app every minute til something new comes up, to see if the results match on each platform at any given time. First I am going to try using the HTML scraper like I said since I've got some free time. I'll post the results shortly.
So perhaps I'm doing something wrong, but when I provide the following: Parameters:
Options:
The search term submitted is "undefined". It only picks up a search term if I define "q" as I would if I were using the API scanning type, which is not used in the HTML Scraper according to the readme.
Your understanding is mostly correct based on what I've seen except I'm not 100% sure of the effect of "bump up" on creation date and modification date, in addition to the uncertainty around start date. It's also unclear whether or not non-corporate accounts can change start date.
Keep in mind that my sample size is rather small (basically just me scraping different ads and looking at the API responses). My working theory is that start date is the main date that the site uses, and the date that I should use as well. This would mean that "bump up" simply sets the start date to the current date, but I haven't confirmed this with the examples I've tried. The benefit of you providing examples where you actually saw the problem (and, if possible, the corresponding timestamps when you scraped and failed to retrieve the ad due to timestamps) is that I can then query the API and check the 4 date values to confirm or reject my theory.
As for documentation, there's no official documentation for the mobile API which is part of the problem. I learned what I know by monitoring the communication between the mobile app and Kijiji servers. Here is where I send the HTTP request to retrieve a single ad: https://github.com/mwpenny/kijiji-scraper/blob/09dcb5e1552296d723178969832455ba862def5d/lib/backends/api-scraper.ts#L122
If you're curious about the API, rather than monitoring communication using the app, which is a little involved (I had to rebuild it to get it to read my own self-signed certificate), it should be much easier to just monitor traffic between kijiji-scraper and Kijiji - or you could just use the Node.js debugger. Here is an example API response for this apartment listing: response.txt (GitHub doesn't let me attach XML, so I've made it a text file).
Notice the discrepancy between dates:
<ad:creation-date-time>2020-06-11T03:30:10.000Z</ad:creation-date-time>
<ad:modification-date-time>2021-03-29T13:01:44.000Z</ad:modification-date-time>
<ad:start-date-time>2021-03-29T13:01:44.000Z</ad:start-date-time>
<ad:end-date-time>3000-01-01T00:00:00.000Z</ad:end-date-time>
An example provided by you will help me know if I need to use modification date or start date (in this example they are the same).
As for your failed search, I was just able to search successfully with the following:
const kijiji = require("kijiji-scraper");
let params = {
locationId: kijiji.locations.ONTARIO,
categoryId: kijiji.categories.BUY_AND_SELL,
sortByName: "dateDesc",
keywords: "bicycle"
};
let options = {
minResults: 1,
maxResults: 10,
scraperType: kijiji.ScraperType.HTML // same as "html"
};
kijiji.search(params, options).then(function(ads) {
console.log(ads.length);
for (const ad of ads) {
console.log(ad.title);
}
}).catch(console.error);
Output:
10
Bike
Girls bike - Nakamura Meyou 20" - Used
4 bicycle bike rack
Girls 14-16 inch bike
Wicked fallout plus mens bike
Trek Domane 56 CM” Di2 Carbon Road Bike
Hybrid devinci St-Tropez 400$
For sale - gently used Nordic Track spin bike
Racing Bike Specialized Roubaix Expert Ultegra
Vintage Sekine men’s bike for sale
Where were you seeing undefined
? And using q
instead of keywords
worked?
With regards to the failed search, I had my own little logging setup running which I've realized was just, reading me the value of q. Instead of whatever was actually being sent out as search terms. I think I'll get the node debugger running.
Gunna go get some ice cream while this thing runs for a bit, it should have some examples when I'm back.
Scan is run at 9:43.00PM
PICTONIANS IN ARMS by J. M. Cameron – 1969 - 2020-12-02 11:39.09 p.m. Tablet case 8.9"-10" - 2020-12-04 11:30.12 p.m. Gummi Crib rail cover - 2020-12-04 5:25.11 a.m. Bugatti Divo - 2021-03-29 9:38.19 p.m. Brasil ball cap - 2020-12-04 5:28.26 a.m. Ladies retro boot - 2020-12-04 5:12.45 a.m. Lounge Chair - New condition - 2021-03-29 9:25.05 p.m. Daphne’s Leprechaun Head Cover - 2021-03-29 9:23.00 p.m. 1000 piece puzzle - 2021-03-29 9:21.11 p.m. Nike DNA pack 2.0 (2 pairs) - 2021-03-29 9:20.12 p.m.
Scan is run at 9:43.32PM
Sony digital frame (Brand new) - 2021-03-29 9:42.24 p.m. PICTONIANS IN ARMS by J. M. Cameron – 1969 - 2020-12-02 11:39.09 p.m. Tablet case 8.9"-10" - 2020-12-04 11:30.12 p.m. Gummi Crib rail cover - 2020-12-04 5:25.11 a.m. Bugatti Divo - 2021-03-29 9:38.19 p.m. Brasil ball cap - 2020-12-04 5:28.26 a.m. Ladies retro boot - 2020-12-04 5:12.45 a.m. Lounge Chair - New condition - 2021-03-29 9:25.05 p.m. Daphne’s Leprechaun Head Cover - 2021-03-29 9:23.00 p.m. 1000 piece puzzle - 2021-03-29 9:21.11 p.m.
The ad Sony digital frame (Brand new) was posted at 9:42.24pm today. This is before either scan took place.
Oh, and this was using the HTML scraper. That would lead me to believe the ad wasn't visible at the time of it post Date.
Glad to hear the searching is working, and thanks for the example. Here are the dates:
<ad:creation-date-time>2021-03-30T00:42:24.000Z</ad:creation-date-time>
<ad:modification-date-time>2021-03-30T00:42:25.000Z</ad:modification-date-time>
<ad:start-date-time>2021-03-30T00:42:24.000Z</ad:start-date-time>
<ad:end-date-time>2021-05-29T00:42:24.000Z</ad:end-date-time>
So the Kijiji website HTML uses start date (modification date is a second ahead of what was scraped and examples above confirm it's definitely not creation date).
I think the multiple dates have been a red herring then - it looks like your first explanation was correct and there is a short delay between when the ad is posted and when it becomes visible, even on the website itself. I'll update the readme to mention this, and also switch to using start date for the API-based scraper.
Updated (PR #58), thanks for reporting this. Very nuanced.
As for your project, rather than using the scrape time to determine if an ad is new, one alternative is to save the URL in a set and then compare URLs of potentially new ads against that set. If you want to avoid unbounded memory growth, you'd still need to look at timestamps to purge old ads but it would be more robust. Anyway, there's more than one way to go.
Glad we got that nailed down, although I'm not 100% sure what you meant in that second to last post, first paragraph. (Kijiji website HTML uses start date... ...and the examples above confirm it's definitely not start date)?
I appreciate the suggestion, I was just mulling over how to determine what is new when it popped up. Have you any idea if URL's change when an ad is updated/bumped?
Whoops, typo. I meant to say that the examples above confirm it's definitely not creation date (namely, the apartment listings). I've edited my original comment.
Public-facing URLs contain the ad category, location, and ID. Ad IDs are unique and the location and category can't be changed after posting (see this Kijiji help page). So the URLs shouldn't change. I also doubt Kijiji would do that since anybody could have saved the URL after posting. Though this makes me think that it might be worth making the unique ID easily available on Ad
objects regardless. I'll look into it when I get some more free time. Let me know if you run into trouble using URLs in the mean time.
Thanks for clearing that up, I was pretty confused.
I just did a little test regarding the URL's. I posted an ad titled "New Good Phone" with some bogus information and grabbed the URL, which ended in "new-good-phone/1558341164". I then changed the ad title to "New Bad Phone" and grabbed the URL. The end had changed to "new-bad-phone/1558341164". Visiting the ad thru the original URL still worked, redirecting you to the proper URL with the new title. I then got curious and tried to visit "...asdjwjiajsfhaheiuwf/1558341164" and that also redirected me to the ad.
From there I started removing random chunks of the URL and it seems that if you enter https://kijiji.ca/v-/YourAdID you will be redirected to the ad corresponding to that ID. Those are the only parts of the URL that matter. Interestingly you cannot remove or alter the "v-" bit, that must be the first bit of text after kijiji.ca/.
All that said I think making the unique ID available directly would be super useful considering it's a vital piece of information.
No trouble using URLs so far, it's far more reliable than the old method.
Good find! It makes sense that most of the URL doesn't matter since an ad's ID uniquely identifies it. The rest of the URL is to make it easier for humans to read at a glance. I found that you can also do https://www.kijiji.ca/v-view-details.html?adId=YourAdId. It looks like they're pretty relaxed about URLs.
It also makes sense that they redirect the old URL to the new one, however I can see this being a problem when searching if the URL is what is used to check if the ad has been seen before (granted, I don't think ad titles change very often). I'll expose an id
property on Ad
objects.
Exposed ad ID in PR #59. The latest version is on NPM. It should be easier for you to track seen ads now.
@mwpenny I appreciate the help, thanks!
Issue: Ads aren't retrieved while using the Mobile API scrape method for up to a minute after they are posted.
Details: I have been playing around with this module and have noticed an odd behavior while testing. Lets say we want to write to the console every time a new ad is posted. To do this, we compare the time of our last scrape against the value of the date stored in each ad object retrieved by the scraper. If any ads were posted after our last scrape time, notify the user about them.
The issue here is that, at least when using the Mobile API, certain ads won't be picked up for up to a minute after their post date found in the Ad object. If we change the above example around a little bit:
There is a good chance this ad was not actually picked up during the 5:10pm scan. At that point the "last scan" value would have been something earlier, say 5:05pm, and the ad would have been seen as new. As the ad was not actually picked up during that scrape, its corresponding Ad object is not available to be examined at 5:10pm. Our current scrape at, lets say, 5:15pm picks up the ad, but at this point it is considered to be old since it was posted prior to the last time we ran a scan. The user never finds out about this new posting.
Notes: I have yet to test this using the http scraper, that is next on the list. If the behavior is consistent I believe there would be two possible culprits:
Of course if using the http scraper fixes the issue, then the problem would either be found in the Mobile API or code specific to its handling within the module.
I will update if I find out more.