Closed devinmatte closed 3 years ago
I'd love to help with some of the data entry for this, and honestly if you need help maintaining this, @eevee, I'd love to help with that, too.
How does this typically get updated? How do people actually get a hold of structured data about a new pokemon generation? Is it all manual?
I've been extracting data and have a first version so far: https://gist.github.com/NoelDavies/a7e8e3d959a0619362b291bb10a8fd8d
(It's not in the veekun pokedex format because I found that the data from SwSh doesn't align with the veekun format)
@NoelDavies.
It looks like nice progress! How is it that you're extracting that data from the Switch games?
Is there any way anyone can help with that as well?
+1, let me know how to help if needed
I found a JSON dump on a gist ( https://gist.github.com/NoelDavies/a7e8e3d959a0619362b291bb10a8fd8d) but I'm not sure where it came from. It has a few issues with conflicting national dex numbers that I had to filter out but it might be a good start.
On Thu, Nov 28, 2019 at 2:08 AM Ari V notifications@github.com wrote:
+1, let me know how to help if needed
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/veekun/pokedex/issues/284?email_source=notifications&email_token=AAJNGLMPA2GYRNVDELPXIU3QV6KCZA5CNFSM4JOZM7XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFMDLGY#issuecomment-559429019, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJNGLPUDLLYZW2VU2Y3DITQV6KCZANCNFSM4JOZM7XA .
bulbapedia (and pretty much all other MediaWiki based sites) has an API that you could use to collect data relatively easily. For example, https://bulbapedia.bulbagarden.net/w/api.php?action=parse&page=Grookey+(Pok%C3%A9mon)&prop=wikitext&format=json
We're definitely not going to source our data from Bulbapedia (or any other wiki).
Why? Their info is accurate and actually filled out...
Honestly, without latest gen info, this entire resource becomes rather useless compared to something like Bulbapedia which actually has info. I completely gave up trying to use the pokeapi which uses this as a backend for anything as it's just not worth the time if it doesn't have the latest info.
It's been months since Sword & Shield were released, and there's been zero progress here... would be nice to have something happen.
If it helps anyone, here is json and sprites for the 8th gen Pokemon. The json is just id, name and type names, as that's all I needed. The sprites are also a bit small, as that's all I could source easily.
Also here's the Node script I used to scrape it from Bulbapedia:
const cheerio = require('cheerio')
const https = require('https')
const get = (url, binary) => new Promise((resolve, reject) => {
https
.get(url, resp => {
if (binary) {
resp.setEncoding('binary')
}
let data = ''
resp.on('data', (chunk) => {
data += chunk
})
resp.on('end', () => {
resolve(data)
})
})
.on('error', err => {
throw err
})
})
const get8thGenPokemon = async () => {
const html = await get([
'https://bulbapedia.bulbagarden.net',
'/wiki/List_of_Pokémon_by_index_number_(Generation_VIII)'
].join(''))
const $ = cheerio.load(html)
const pokemon = {}
$('#mw-content-text table.roundy tr').each((i, row) => {
const cells = $(row).find('td')
if (!cells.length) {
return
}
const id = $(cells[1]).text().trim()
if (id < 810) {
// 810 Grookey is the first Gen 8 pokemon
return
}
// TODO get higher res images
const image = $(cells[2]).find('img').attr('src')
const name = $(cells[3]).text().trim()
const types = []
cells.each((i, cell) => {
if (i < 4) {
return
}
const type = $(cell).text().toLowerCase().trim()
if (types.indexOf(type) > -1) {
return
}
types.push(type)
})
pokemon[id] = {
image,
name,
types
}
})
return pokemon
}
module.exports = {
get,
get8thGenPokemon
}
I have compiled all data from original dumps from @ kwsch and @ Kaphotics https://github.com/route1rodent/swordshield-data @eevee @magical if you are interested
It might be easy since the metagame has not changed drastically, we almost have all tables needed (excepting deprecated and dyna/g-max moves)
dumps: https://github.com/route1rodent/swordshield-data/tree/master/data/raw
my repo has a script to convert some dumps to JSON, but not all of them (I havent had enough time). many of them require different parsing techniques.
@route1rodent so in theory what other work needs to be done to take your stuff and incorporate it into here?
Sorry for the newbie question, but I’ve been trying to figure out how this project is structured for the better part of a month, but I don’t quite understand what format everything needs to be in to get it working. I have the time to do it, but I need some help getting up to speed on what needs to be done.
@route1rodent nice! that'd work for me as a trustworthy source, if someone wanted to figure out how to import the data into veekun.
i recall that gen 7 data was loaded into the database as yaml. i’ve converted quite a lot of the data from @route1rodent’s repository to yaml—would that be of any use? i’m not sure what the gen 7 yamls looked like.
@kgsbowtie There's no set process, really, but generally what we've done is write scripts that read the data (usually from a ROM) and either output sql statements to a file or talk directly to the database to add the data. There's some documentation of the tables at https://veekun.github.io/pokedex/main-tables.html and some simple sample scripts in https://github.com/veekun/pokedex/tree/master/scripts.
@kyeugh The yaml stuff was experimental and unfinished. It was never merged into master; i don't know if it still works. If you want to fiddle around with it, the extraction code (rom -> yaml) is here https://github.com/veekun/pokedex/tree/yaml/pokedex/extract and the importing code (yaml -> db) is here https://github.com/veekun/pokedex/blob/yaml/scripts/sumo-yaml-to-db.py.
As you alluded to, it's the structure of the data that matters, not the format (YAML is a superset(ish) of JSON after all), so a simple mechanical conversion of @route1rodent's JSON to YAML doesn't really help.
in my experience working with yaml can be slow when you have large data sets (e.g. pokemon_moves table or encounters). I don't know what's the state of the YAML branch, I've seen it for a while already and I think it's outdated. IMHO I'd rather stick to current CSV format or JSON files with one object per line.
That way either with CSV or with that optimized JSON format, your code (e.g. python script to fill tables with data) can read big files line by line, process the line and next, instead of loading a big chunk of data into memory, which is not efficient.
These are my sources for the data dumps:
https://pastebin.com/u/Kaphotics https://pastebin.com/u/SciresM
Kaphotics and SciresM (same nicks on Twitter) are well known dataminers. They basically datamined all current and past games of the last gens. I trust their data, but it's sometimes tricky to parse.
They use the pkNX
project for the data dumps.
Additional note: It turns out that Pokemon HOME contains TM/TR data of all Pokemon https://twitter.com/SciresM/status/1228085803093413888 :)
@kgsbowtie a script is already done in my project to dump the Pokemon data into JSON. That same script can be reused / modified to dump it to CSV in a way that veekun/pokedex can understand it.
Then we will have to write parsers for all other data and do the same, because I haven't yet written any script to parse the rest of the raw data .
Why? Their info is accurate and actually filled out...
Honestly, without latest gen info, this entire resource becomes rather useless compared to something like Bulbapedia which actually has info. I completely gave up trying to use the pokeapi which uses this as a backend for anything as it's just not worth the time if it doesn't have the latest info.
It's been months since Sword & Shield were released, and there's been zero progress here... would be nice to have something happen.
Completely agree. Wondering if there's any progress on this as I (as many, many others) are dependent on PokeAPI which uses veekun, and most of the 17,000,000 API calls every month are from users who want to use the most recent games. And said games have been out for many months now, and there is no short supply of people willing to help. I'm just wondering what the hold-up is.
I can't and won't speak for the maintainers, but judging from things like this blog post by Eevee I think the holdup is they don't want more data dumps falling from the sky, whether they come from a reliable source like Kaphotics/SciresM or a free-for-all hellfest like Bulbapedia - they want a pipeline where vendored scripts extract the data directly from primary sources, such that anyone who wants to can rerun the extraction process with their own copies of the game and verify that the data is correct.
@ketsuban, @justingolden21
As eevee wrote in her blog post,
Part of it is that the team has never been very big, and all of us have either drifted away or gotten tied up in other things.
That's the real reason for the delay, moreso than any technical hurdle.
there is no short supply of people willing to help
If y'all are serious about wanting to help, please join irc and start asking questions. It's lonely in here :(
(Also we have a discord now i guess?? Join IRC or message eevee for an invite.)
I'd love to get a code tour if someone had time to bootstrap me. I tried using the repo before but wasn't too successful with the tooling, but the CSVs have been really helpful in the past.
is there anything I can do to help get this feature added?
Gen 8 update should happen IMHO, with or without game data dumps (using Bulbapedia or Serebii as source).
ROM data dumps parsing as a rule are always keeping people away from contributing to this project, so I'd rather have this project as an open DB where everyone is free to contribute with the missing pieces.
That requires of course other people to review the correctness of the data.
Since there is no clear direction for this project anymore, I'd rather go on that direction.
@ctrlaltdylang I'd say people to start updating the CSV files for gen8 in a Pull Request otherwise this will be delayed indefinitely .
We can join gen-8 related Pull Requests in a common branch different than master, to do a big release at the end.
@route1rodent I should have some downtime over the next couple weeks, I'll try to get a PR up by the end of the month. I plan on using the pokedex here to support a project I'm working on.
@route1rodent is there a good PR I can reference to make sure I'm adding everything correctly?
I'm just a random yokel, but I think it's very important NOT to source anything on Bulbapedia. Anything that is crowdsourced or informally gathered is unverifiable and shouldn't be codified as truth.
@jnsnow I understand, but it's better than having nothing, and you can always verify using multiple sources. Once the scripts for importing game data directly are ready then we can use those instead.
@ctrlaltdylang I don't know, for this huge update editing CSV directly can be tricky and more prone to errors. easiest way is to dump the CSV into a sqlite DB using the CLI executable, do your changes in the DB with a Sqlite SQL client and dump that back to the CSV files.
I think just editing CSV would probably be the way to go, I don't think anyone's going to dump a PR here where they've edited all of the CSV, I imagine it'll mostly be small(ish) changes, which can be reviewed. Or maybe that can be a rule. Just my two cents here, I'm very pro-edit the CSV directly in the interest of getting Gen 8 here out as quick as possible. Especially when a new DLC is 4 days away and another a couple more months out (new moves and new Pokemon to tackle)
This is the most accurate and up to date thing I've managed to come across for Gen 8. https://gist.github.com/NoelDavies/36be9a000d55e9f0a6e8ec509da2fc90 Hope this helps somebody (I started working on a node script to parse, and format this into a JSON object - just haven't finished it yet) - The issue we have is multiple forms (I've not implemented this into Pokélink yet) but they're all there
I'd be happy to help work on this- not sure where to start though.
Bump. Any updates?
Status update: I spent some time at the end of June seriously looking into this. It's been a while, so some of that time was spent figuring out Switch hacking tools and generally derusting my knowledge, but I got some preliminary data (level-up moves and egg moves) dumped from Sword. No ETA, but Gen 8 data will definitely happen sooner or later.
If you want to assist with the ripping process, feel free to join the veekun discord. I could use some help figuring out how to apply the 1.2.0 update of Sw/Sh. Also, just having other people to talk to who are interested in datamining would help with motivation.
That said, my top priority right now is rescuing the website from complete and total decay (cf https://github.com/veekun/spline-pokedex/issues/115), and ripping Gen 8 data is currently taking a second seat to that.
In the meantime, PokeAPI has created a temporary pokedex fork with the intent of adding data scraped from Bulbapedia and other sources. If you just want the data and can't wait, maybe check that out.
That's awesome to hear. Thanks for all the hard work, I'm sure many people appreciate it : )
@magical - We at Pokelink would love to help out if we can :)
Forward: New Git user here by the way, looking for some advice on how to contribute to this issue. I'm aware this repo relies on automated collection from game data first, automated collection from reputable data mining projects second, and well whatever I did not as much really. I'm aware I can make my own fork, but I'm interested to know whether any of this can be pulled by this project, or the Pokeapi fork. Or perhaps some other alternative I'm not thinking of ?
I've managed to scrape data (automated via a Node project) from Bulbapedia for some tables for the veekun database. The new data primarily comes from gen 8, but not exclusively (any generation-specific data I added that would conflict with other generations would exist in a table that has an existing gen/version/region column, so any previous-gen data can be filtered out).
The data mostly consists of (Gen8) pokemon species, variety (alternative variations included), form, their names (english), dex values, pokemon base stats, types. No moves, abilities, encounters etc.
This data can of course be dumped into changes for the csv files. If anyone is interested in me making a fork with changes to the csv files, and/or my scraping project code, let me know.
The new gathered data pertains to the following tables and columns in the database:
Usage key
pokemon_species
Example Records
id | identifier | generation_id | evolves_from_species_id | evolution_chain_id | color_id | shape_id | habitat_id | gender_rate | capture_rate | base_happiness | is_baby | hatch_counter | has_gender_differences | growth_rate_id | forms_switchable | order | conquest_order |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
808 | NULL | 8 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
876 | NULL | 8 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
889 | NULL | 8 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
890 | NULL | 8 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
pokemon_species_names
(Only local_language_id=9 ; english)
Example Records
pokemon_species_id | local_language_id | name | genus |
---|---|---|---|
808 | 9 | Meltan | NULL |
876 | 9 | Indeedee | NULL |
889 | 9 | Zamazenta | NULL |
890 | 9 | Eternatus | NULL |
pokemon
(including alternative variations of same species)
Example Records
id | identifier | species_id | height | weight | base_experience | order | is_default |
---|---|---|---|---|---|---|---|
10166 | NULL | 876 | NULL | NULL | NULL | NULL | 1 |
10167 | NULL | 876 | NULL | NULL | NULL | NULL | 0 |
10181 | NULL | 889 | NULL | NULL | NULL | NULL | 0 |
10184 | NULL | 889 | NULL | NULL | NULL | NULL | 0 |
10214 | NULL | 890 | NULL | NULL | NULL | NULL | 1 |
10215 | NULL | 890 | NULL | NULL | NULL | NULL | 0 |
10264 | NULL | 808 | NULL | NULL | NULL | NULL | 1 |
pokemon_forms
(Exactly one new record for each new pokemon
meaning no alternative forms that are purely aesthetic in nature are present (i.e An earlier gen example would be Burmy with 1 record in pokemon
with 3 corresponding records in pokemon_forms
))
Example Records
id | identifier | form_identifier | pokemon_id | introduced_in_version_group_id | is_default | is_battle_only | is_mega | form_order | order |
---|---|---|---|---|---|---|---|---|---|
10325 | NULL | NULL | 10166 | NULL | 1 | NULL | 0 | NULL | NULL |
10326 | NULL | NULL | 10167 | NULL | 1 | NULL | 0 | NULL | NULL |
10340 | NULL | NULL | 10181 | NULL | 1 | NULL | 0 | NULL | NULL |
10343 | NULL | NULL | 10184 | NULL | 1 | NULL | 0 | NULL | NULL |
10373 | NULL | NULL | 10214 | NULL | 1 | NULL | 0 | NULL | NULL |
10374 | NULL | NULL | 10215 | NULL | 1 | NULL | 0 | NULL | NULL |
10423 | NULL | NULL | 10264 | NULL | 1 | NULL | 0 | NULL | NULL |
pokemon_form_names
(Only local_language_id=9 ; english)
Example Records
pokemon_form_id | local_language_id | form_name | pokemon_name |
---|---|---|---|
10325 | 9 | Male | NULL |
10326 | 9 | Female | NULL |
10340 | 9 | Hero Of Many Battles | NULL |
10343 | 9 | Crowned Shield | NULL |
10374 | 9 | Eternamax | NULL |
pokedexes
(Didn't add to regions
therefore some NULLs)
ALL added records below
id | region_id | identifier | is_main_series |
---|---|---|---|
10026 | NULL | galar | 1 |
10027 | NULL | isle-of-armor | 1 |
10028 | 1 | kanto-expanded | 1 |
pokemon_dex_numbers
(Additions for pokedex_id = 1(National), 12(Kalos Central; see below), 10026-10028 (Gen8))
Example Records
species_id | pokedex_id | pokedex_number |
---|---|---|
808 | 1 | 808 |
808 | 10028 | 152 |
876 | 1 | 876 |
876 | 10026 | 337 |
889 | 1 | 889 |
889 | 10026 | 399 |
890 | 1 | 890 |
890 | 10026 | 400 |
Kalos-Central Missing added records below
species_id | pokedex_id | pokedex_number |
---|---|---|
719 | 12 | 151 |
720 | 12 | 152 |
721 | 12 | 153 |
version_groups
(All added records below)
ALL added records below
id | identifier | generation_id | order |
---|---|---|---|
10001 | sword-shield | 8 | NULL |
10002 | lets-go-pikachu-eevee | 8 | NULL |
pokedex_version_groups
ALL added records below
pokedex_id | version_group_id |
---|---|
10026 | 10001 |
10027 | 10001 |
10028 | 10002 |
pokemon_types
Example Records
pokemon_id | type_id | slot |
---|---|---|
10166 | 14 | 1 |
10166 | 1 | 2 |
10167 | 14 | 1 |
10167 | 1 | 2 |
10181 | 2 | 1 |
10184 | 2 | 1 |
10184 | 9 | 2 |
10214 | 4 | 1 |
10214 | 16 | 2 |
10215 | 4 | 1 |
10215 | 16 | 2 |
10264 | 9 | 1 |
Additionally, I've scraped Bulbapedia data (complete dataset) for some things that have changed across generations such as pokemon types, base-stats, and exp gain, but this kind of per-gen data for these kinds of information do not seem to be supported in the current schema. Am I correct assuming this repo only intends to reflect this kind of info for whichever latest generation it supports, and exclude previous-gen data ? If it were to include this info oneday, would it follow the *-changelog format ? Mine didn't comply with that format, but it could be converted with ease.
Examples:
CUSTOM_pokemon_stats_by_gen
pokemon_id | generation_id | stat_id | base_stat | effort |
---|---|---|---|---|
25 | 1 | 3 | 30 | NULL |
25 | 2 | 3 | 30 | NULL |
25 | 3 | 3 | 30 | 0 |
25 | 4 | 3 | 30 | 0 |
25 | 5 | 3 | 30 | 0 |
25 | 6 | 3 | 40 | 0 |
25 | 7 | 3 | 40 | 0 |
25 | 8 | 3 | 40 | 0 |
CUSTOM_pokemon_type_changes
pokemon_id | generation_id | from_type_id | to_type_id | slot |
---|---|---|---|---|
39 | 6 | NULL | 18 | 2 |
40 | 6 | NULL | 18 | 2 |
81 | 2 | NULL | 9 | 2 |
82 | 2 | NULL | 9 | 2 |
CUSTOM_pokemon_exp_gain_by_gen
pokemon_id | generation_id | exp |
---|---|---|
13 | 3 | 52 |
13 | 4 | 52 |
13 | 5 | 39 |
13 | 6 | 39 |
13 | 7 | 39 |
@RyanVereque I'm not sure exactly what you're asking. A couple of broad comments:
You should check out PokeAPI's fork. We won't be accepting data scraped from Bulbapedia here.
We're kind of trying to move away from changelog tables. For stat/type changes the solution is probably to add a generation column to the existing tables (or maybe a first/last generation range). There's an open issue for this; see https://github.com/veekun/pokedex/issues/107.
Pokemon Sword and Shield are out, there are a bunch of new pokemon as a result
810-890 https://serebii.net/pokedex-swsh/