OSRS Wiki Data Scraping Method Review

osrsbox commented 5 years ago

A couple of recent issues (specifically #124, #125) has introduced some small inconsistencies with the alignment of data sourced from the OSRS Wiki used to add metadata into the item database.

When I first built the item database, the OSRS Wiki did not have the RuneScape:Lookup API search - the same one discussed in the issues stated above. The OSRS Wiki also (AFAIK) did not have any ID numbers on any pages - it was a while ago that I started this project... when the OSRS Wiki was still hosted on Fandom. Instead I performed API queries and extracted all pages from the Category:Items category... then searched using the page name. This required normalizing a lot of the names, and writing specific parsers for determining the version number for each page (where multiple items were present on the same page).

In terms of future development, it seems sensible to leverage either of the following options:

Use the RuneScape:Lookup API query for every item ID
Use the id property in the Infobox Item wikitext template

The monsters-api development branch uses the id property to search for valid pages that have been extracted using the current OSRS Wiki extraction tools (based on category). Continuing with this development/data collection method seems sensible for a number of reasons:

Dramatically reduce the number of API queries (my current category method scrapes 7,300 pages, compared to 23,000+ Lookup API queries that would be needed)
Keep the existing workflow which would be beneficial (read: easier!)
Still provides access to the id property in the Infobox Item wikitext template

Going forward: The first thing that needs to be tested is how much coverage does the OSRS Wiki have for item IDs? From there, a logical decision can be made. This issue will be updated in due course with additional information.

jakebellotti commented 5 years ago

Based on my own testing (which was based on MOST, but not all items) there were about 200 valid items that didn't have Item IDs attached to the page. Will try give the total number once I end up scraping every page.

jakebellotti commented 5 years ago

Furthermore, there were 131 instances where the correct page was resolved, but there were multiple item variants on those pages, and it did not specify which variant it was.

osrsbox commented 5 years ago

Hey @jakebellotti - thanks for the feedback. My numbers seem much different. Based on the current items in this project, I get the following results:

Missing item IDs: 1,652
Missing item names: 1,082

Some of the more prominent items are: construction items, pets, caskets/clue scrolls and some Unobtainable items. The others are a collection of the weird and strange items in OSRS.

FYI - this is how I gather the data: I didn't want to brute force the OSRS Wiki API with tens of thousands of requests, so I used the Category:Items, Category:Pets, Category:Construction and Category:Furniture data available in this project - basically all pages tagged as items. Then extracted the item IDs from the Infobox item mediawiki template. When searching for item IDs, I skipped all noted and placeholder items. I am unsure on how the results would vary using the RuneScape:Lookup service.

Is the main difference between our results caused by the base item ID number set that each of us are using. There are still a few Null and empty string items in my database (but I excluded these from testing). Where are you getting your list of valid item IDs from?

I find the redirection to the relevant page interesting. Do you have any examples that I could further investigate. Thanks for the help/feedback, it is very useful comparing results and getting insight on other methods.

jakebellotti commented 5 years ago

I used the RuneScape:Lookup for all my requests. The way I prepared data was from the cache item definitions, filtering null, noted and placeholder items. The lookup takes the Item ID and Name, it searches for a page with the ID on it, if it is not detected then it uses the name and builds the address.

Total number of unique pages: 7269 Total of items that returned a page: 10728

The 131 I said that didn't specify the variant, I think that was falling back to using the name to build the URL. I actually can't remember if the numbers included the 'strange' items. I will have to run my tests again, some of the items that didn't resolve a page before are now resolving them, the number of items with issues may be even less.

Also, @osrsbox just wondering, do you have Discord or any other good form of contact? It would be good to be able to chat more, since we are actively working on the same type of project.

jakebellotti commented 5 years ago

"Valid" items needs to be more clearly defined. Placeholder, noted and null items obviously are not valid.

The 'strange items' are a bit harder to filter out. The different types include: -Items that are in the cache, but were from christmas etc. events, and removed before OSRS came out -Items used in interfaces (e.g. flatpack items have an item with identical name but different image) -Unobtainable items defined by the OSRS wiki themselves -Strange, duplicate items that have different properties but same image (e.g the Abyssal Whip)

jakebellotti commented 5 years ago

As for items (which are not really items, just there for an icon placeholder) such as NPC and construction items, a possible way to filter them out is by reading the client scripts. Script 661.rs2asm is the script that controls the skill guide. An example of the data obtained: For slayer monster items:

LABEL13840: iconst 5 iconst 4133 sconst "Crawling hands" return
jump LABEL14069 LABEL13845: iconst 7 iconst 4521 sconst "Cave bugs" return
jump LABEL14069 LABEL13850: iconst 10 iconst 4134 sconst "Cave crawlers" return
jump LABEL14069

osrsbox commented 5 years ago

@jakebellotti - I like the definition of valid items you have provided. I want to work on removing these from the item database. To save myself some research time... how do you extract the client scripts? And how do you map the client script (ID?) number to what it is used for? What I think would be useful is a selection of (probably static) JSON files that define invalid items that can be loaded and used to skip specific item IDs when processing items.

After reading your method for extracting data from the OSRS Wiki, I think the primary difference is that I was not using the name in my search to fall back on when the id search failed. I will run it again soon to see the results. I think once the invalid items are removed, the OSRS Wiki lookups will be much more effective, and hopefully fix the current item problems you posted issues on earlier.

jakebellotti commented 5 years ago

@osrsbox I used the RuneLite cache dumper for it. It’s. Very similar to the way you would dump the ItemDefinitions. The ID of the scripts are static, they don’t change. The two biggest files when you do the dump are the ones that define what to put on the skill guide interface. But yes, removing the invalid items is crucial. I am at a point now where my scraper is well optimised and everything is pretty much working the way I want it to, but it doesn’t feel right because I keep having to putting ‘hacky’ code in to avoid issues when it is fed a page that isn’t correct.

Once they are removed, I wonder if there will be any pages that will fall back at all.

I might try to change my scraping method too. Currently I get all item IDs, filter those and then loop through the IDs, then retrieve whatever page that was returned, assume that was the correct page and scrape. I was thinking of just instead looping through all the unique pages I have, and scraping only those who have an ID on the page.

jakebellotti commented 5 years ago

From my latest testing: [nullURL=139,nullSearchResult=217,nullItemProperties=552] Null URL is when it returned a URL, but it was not a valid one. For example, some were linking to a search page only. Null search result is when it linked to a page with variants, but did not specify which variant is was. Null item properties was when it linked to a page, but there was an error parsing data. Some of those results may be fixed later, but I know that mostly it was due to it not being a valid Item page, and was something like an NPC page.

osrsbox commented 5 years ago

Thanks @jakebellotti for all the help on this issue and for the information you provided on your item scraping method, and the results. Also, the client scripts method was amazing, and really helped determine items that are not really items. I ended up using a new invalid-items.json file, which still needs some tweaking, but the general results using this along with the id property from the OSRS Wiki proved to produce much more accurate data. I manually checked some the items you listed (graceful, and quest items) and the new data is much more accurate.

In the future I will continue looking at duplicate items. I still keep an entire database, including duplicates and placeholders/noted items - as I want the database to be 100% complete, but will look at adding another property in the future to help identify items more easily. Thanks again.

osrsbox / osrsbox-db

OSRS Wiki Data Scraping Method Review #126