mwpenny / kijiji-scraper

A lightweight node.js module for retrieving and scraping ads from Kijiji
MIT License
95 stars 43 forks source link

Category Attributes #65

Open MaximumPotato opened 1 year ago

MaximumPotato commented 1 year ago

Many of the category attributes (at least under real estate listings) return as numerical values as opposed to what you read on the website. For some attributes this is simply Yes/No answers being translated to 1 or 0, but other attributes that have more than two alphanumeric options can return with values 0, 1, 2, etc.. In #55 I believe you referred to these as internal values. This was never an issue with the categories I've scraped in the past, but the real estate listings have a lot of attributes. Have you ever done work to translate these internal values to their human-readable counterparts? I wanted to check before I do it myself by hand.

mwpenny commented 1 year ago

Hi there. Yes/no should be converted to true/false. This was broken in the past and may have recently broken again. I'll look into that.

For other numerical values, could you provide some links to ads that trigger the unexpected behavior so I can look at the raw API responses? The human readable information may already be present (it has been for other attributes in the past).

MaximumPotato commented 1 year ago

Sure, here's a quick one: https://www.kijiji.ca/v-apartments-condos/city-of-halifax/condo-halifax-harbour-waterfront-boardwalk/1650025170

And here's the toString for that. About halfway through you'll see that smoking permitted has a value of 2:

`[02/10/2023 @ 18:14] Condo Halifax Harbour - Waterfront Boardwalk https://www.kijiji.ca/v-apartments-condos/city-of-halifax/condo-halifax-harbour-waterfront-boardwalk/1650025170

mwpenny commented 1 year ago

Perfect, thanks. I'll have time to look into this tomorrow.

mwpenny commented 1 year ago

I took a look at what Kijiji is returning for this ad and others like it.

The attributes you are talking about are enums. For example, smokingpermitted has 3 possible values:

You could create this mapping yourself for every possible enum, but that would be pretty time consuming. Instead, it's possible to download the information about all enums from Kijiji itself and generate helper types as I have done for location and category IDs. I think this is worth doing. Enums and better attribute handling in general would really improve kijiji-scraper.

However, doing it right will take a bit of work. Ideally, this library would provide calling code with a list of all attributes expected in a response, along with their types. That way users wouldn't need to guess/experiment as much when using the scraper.

I've created #67 to track this and will work on it when I have more free time (likely not for a few months). For now, the easiest thing to do is compare the internal values yourself (or submit a PR 😃).


Also, some enums like yard will only ever have the values "Yes" and "No". Kijiji probably avoided a boolean here to leave room for more values in the future, but for all intents and purposes attributes like these are booleans. The scraper should already be converting them to true/false. On the latest version of the scraper, I'm only able to reproduce your bad boolean output when using HTML scraping. Have you enabled that? Or maybe you're on an old version before API scraping was added? Updating and/or switching to API-based scraping (default behavior) should result in proper boolean parsing.

Regardless, I've fixed the boolean scraping issue in the HTML scraper. The latest version on NPM has the fix.

r-e-w-m commented 1 year ago

The attributes you are talking about are enums. For example, smokingpermitted has 3 possible values:

* No (0)

* Yes (1)

* Outdoors only (2)

Weirdly enough there are parameters such as petsallowed (in apartment listings) that will return a mix of numeric values and strings. petsallowed in particular will show up as either 0, 1, or limited. It seems as if they inconsistently switch between using strings (in the case of petsallowed) and numeric values (in the case of smokingpermitted) with options that have more than 2 values.

You could create this mapping yourself for every possible enum, but that would be pretty time consuming. Instead, it's possible to download the information about all enums from Kijiji itself and generate helper types as I have done for location and category IDs. I think this is worth doing. Enums and better attribute handling in general would really improve kijiji-scraper.

I would be interested in knowing how you went about generating the helper types and obtaining the enums for the Location and Category IDs. Being able to grab a list of valid attribute types at any given time would be helpful.

Tangentially, earlier today I was reading through the open issue from 2019 regarding search parameter inconsistency and such. This was after running into trouble trying to filter my searches via kijiji-scraper, rather than my own code as I had been doing. Perhaps I was formatting incorrectly, but I couldn't seem to filter posts with attributes like heat/water/hydro included or pets allowed.

Ideally, this library would provide calling code with a list of all attributes expected in a response, along with their types. That way users wouldn't need to guess/experiment as much when using the scraper.

Again, more than a little rusty here. Are you saying that the library should ping kijiji for valid attributes? Or just that that it should provide a mapping of known attributes as with Location and Category? (Or perhaps something else entirely haha)

Also, some enums like yard will only ever have the values "Yes" and "No". Kijiji probably avoided a boolean here to leave room for more values in the future, but for all intents and purposes attributes like these are booleans. The scraper should already be converting them to true/false. On the latest version of the scraper, I'm only able to reproduce your bad boolean output when using HTML scraping. Have you enabled that? Or maybe you're on an old version before API scraping was added? Updating and/or switching to API-based scraping (default behavior) should result in proper boolean parsing.

Although my code is a few years old, I went to check and my scraper type should be using the default value of API. I peeked at the readme, which mentions the distinction between API and HTML scraping, so the version I have is new enough for it to be the default scraping method. I will try updating and hope nothing breaks. Running toString on an ad object should print true / false for most of those 0 / 1 parameters if it is working properly, correct?