veekun / pokedex

more than you ever wanted to know about Pokémon
MIT License
1.44k stars 637 forks source link

Gen 8 pokemon #284

Closed devinmatte closed 3 years ago

devinmatte commented 4 years ago

Pokemon Sword and Shield are out, there are a bunch of new pokemon as a result

810-890 https://serebii.net/pokedex-swsh/

thosakwe commented 4 years ago

I'd love to help with some of the data entry for this, and honestly if you need help maintaining this, @eevee, I'd love to help with that, too.

naddeoa commented 4 years ago

How does this typically get updated? How do people actually get a hold of structured data about a new pokemon generation? Is it all manual?

NoelDavies commented 4 years ago

I've been extracting data and have a first version so far: https://gist.github.com/NoelDavies/a7e8e3d959a0619362b291bb10a8fd8d

(It's not in the veekun pokedex format because I found that the data from SwSh doesn't align with the veekun format)

chamander commented 4 years ago

@NoelDavies.

It looks like nice progress! How is it that you're extracting that data from the Switch games?

Is there any way anyone can help with that as well?

arirawr commented 4 years ago

+1, let me know how to help if needed

naddeoa commented 4 years ago

I found a JSON dump on a gist ( https://gist.github.com/NoelDavies/a7e8e3d959a0619362b291bb10a8fd8d) but I'm not sure where it came from. It has a few issues with conflicting national dex numbers that I had to filter out but it might be a good start.

On Thu, Nov 28, 2019 at 2:08 AM Ari V notifications@github.com wrote:

+1, let me know how to help if needed

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/veekun/pokedex/issues/284?email_source=notifications&email_token=AAJNGLMPA2GYRNVDELPXIU3QV6KCZA5CNFSM4JOZM7XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFMDLGY#issuecomment-559429019, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJNGLPUDLLYZW2VU2Y3DITQV6KCZANCNFSM4JOZM7XA .

simoniz0r commented 4 years ago

bulbapedia (and pretty much all other MediaWiki based sites) has an API that you could use to collect data relatively easily. For example, https://bulbapedia.bulbagarden.net/w/api.php?action=parse&page=Grookey+(Pok%C3%A9mon)&prop=wikitext&format=json

magical commented 4 years ago

We're definitely not going to source our data from Bulbapedia (or any other wiki).

simoniz0r commented 4 years ago

Why? Their info is accurate and actually filled out...

Honestly, without latest gen info, this entire resource becomes rather useless compared to something like Bulbapedia which actually has info. I completely gave up trying to use the pokeapi which uses this as a backend for anything as it's just not worth the time if it doesn't have the latest info.

It's been months since Sword & Shield were released, and there's been zero progress here... would be nice to have something happen.

sfyfedotcom commented 4 years ago

If it helps anyone, here is json and sprites for the 8th gen Pokemon. The json is just id, name and type names, as that's all I needed. The sprites are also a bit small, as that's all I could source easily.

8th-gen-pokemon.zip

Also here's the Node script I used to scrape it from Bulbapedia:

const cheerio = require('cheerio')
const https = require('https')

const get = (url, binary) => new Promise((resolve, reject) => {
  https
    .get(url, resp => {
      if (binary) {
        resp.setEncoding('binary')
      }
      let data = ''
      resp.on('data', (chunk) => {
        data += chunk
      })
      resp.on('end', () => {
        resolve(data)
      })
    })
    .on('error', err => {
      throw err
    })
})

const get8thGenPokemon = async () => {
  const html = await get([
    'https://bulbapedia.bulbagarden.net',
    '/wiki/List_of_Pokémon_by_index_number_(Generation_VIII)'
  ].join(''))
  const $ = cheerio.load(html)
  const pokemon = {}
  $('#mw-content-text table.roundy tr').each((i, row) => {
    const cells = $(row).find('td')
    if (!cells.length) {
      return
    }
    const id = $(cells[1]).text().trim()
    if (id < 810) {
      // 810 Grookey is the first Gen 8 pokemon
      return
    }
    // TODO get higher res images
    const image = $(cells[2]).find('img').attr('src')
    const name = $(cells[3]).text().trim()
    const types = []
    cells.each((i, cell) => {
      if (i < 4) {
        return
      }
      const type = $(cell).text().toLowerCase().trim()
      if (types.indexOf(type) > -1) {
        return
      }
      types.push(type)
    })
    pokemon[id] = {
      image,
      name,
      types
    }
  })
  return pokemon
}

module.exports = {
  get,
  get8thGenPokemon
}
route1rodent commented 4 years ago

I have compiled all data from original dumps from @ kwsch and @ Kaphotics https://github.com/route1rodent/swordshield-data @eevee @magical if you are interested

It might be easy since the metagame has not changed drastically, we almost have all tables needed (excepting deprecated and dyna/g-max moves)

dumps: https://github.com/route1rodent/swordshield-data/tree/master/data/raw

my repo has a script to convert some dumps to JSON, but not all of them (I havent had enough time). many of them require different parsing techniques.

kgsbowtie commented 4 years ago

@route1rodent so in theory what other work needs to be done to take your stuff and incorporate it into here?

Sorry for the newbie question, but I’ve been trying to figure out how this project is structured for the better part of a month, but I don’t quite understand what format everything needs to be in to get it working. I have the time to do it, but I need some help getting up to speed on what needs to be done.

magical commented 4 years ago

@route1rodent nice! that'd work for me as a trustworthy source, if someone wanted to figure out how to import the data into veekun.

skylar32 commented 4 years ago

i recall that gen 7 data was loaded into the database as yaml. i’ve converted quite a lot of the data from @route1rodent’s repository to yaml—would that be of any use? i’m not sure what the gen 7 yamls looked like.

magical commented 4 years ago

@kgsbowtie There's no set process, really, but generally what we've done is write scripts that read the data (usually from a ROM) and either output sql statements to a file or talk directly to the database to add the data. There's some documentation of the tables at https://veekun.github.io/pokedex/main-tables.html and some simple sample scripts in https://github.com/veekun/pokedex/tree/master/scripts.

magical commented 4 years ago

@kyeugh The yaml stuff was experimental and unfinished. It was never merged into master; i don't know if it still works. If you want to fiddle around with it, the extraction code (rom -> yaml) is here https://github.com/veekun/pokedex/tree/yaml/pokedex/extract and the importing code (yaml -> db) is here https://github.com/veekun/pokedex/blob/yaml/scripts/sumo-yaml-to-db.py.

As you alluded to, it's the structure of the data that matters, not the format (YAML is a superset(ish) of JSON after all), so a simple mechanical conversion of @route1rodent's JSON to YAML doesn't really help.

route1rodent commented 4 years ago

in my experience working with yaml can be slow when you have large data sets (e.g. pokemon_moves table or encounters). I don't know what's the state of the YAML branch, I've seen it for a while already and I think it's outdated. IMHO I'd rather stick to current CSV format or JSON files with one object per line.

That way either with CSV or with that optimized JSON format, your code (e.g. python script to fill tables with data) can read big files line by line, process the line and next, instead of loading a big chunk of data into memory, which is not efficient.

These are my sources for the data dumps:

https://pastebin.com/u/Kaphotics https://pastebin.com/u/SciresM

Kaphotics and SciresM (same nicks on Twitter) are well known dataminers. They basically datamined all current and past games of the last gens. I trust their data, but it's sometimes tricky to parse.

They use the pkNX project for the data dumps.

Additional note: It turns out that Pokemon HOME contains TM/TR data of all Pokemon https://twitter.com/SciresM/status/1228085803093413888 :)

route1rodent commented 4 years ago

@kgsbowtie a script is already done in my project to dump the Pokemon data into JSON. That same script can be reused / modified to dump it to CSV in a way that veekun/pokedex can understand it.

Then we will have to write parsers for all other data and do the same, because I haven't yet written any script to parse the rest of the raw data .

justingolden21 commented 4 years ago

Why? Their info is accurate and actually filled out...

Honestly, without latest gen info, this entire resource becomes rather useless compared to something like Bulbapedia which actually has info. I completely gave up trying to use the pokeapi which uses this as a backend for anything as it's just not worth the time if it doesn't have the latest info.

It's been months since Sword & Shield were released, and there's been zero progress here... would be nice to have something happen.

Completely agree. Wondering if there's any progress on this as I (as many, many others) are dependent on PokeAPI which uses veekun, and most of the 17,000,000 API calls every month are from users who want to use the most recent games. And said games have been out for many months now, and there is no short supply of people willing to help. I'm just wondering what the hold-up is.

ketsuban commented 4 years ago

I can't and won't speak for the maintainers, but judging from things like this blog post by Eevee I think the holdup is they don't want more data dumps falling from the sky, whether they come from a reliable source like Kaphotics/SciresM or a free-for-all hellfest like Bulbapedia - they want a pipeline where vendored scripts extract the data directly from primary sources, such that anyone who wants to can rerun the extraction process with their own copies of the game and verify that the data is correct.

magical commented 4 years ago

@ketsuban, @justingolden21

As eevee wrote in her blog post,

Part of it is that the team has never been very big, and all of us have either drifted away or gotten tied up in other things.

That's the real reason for the delay, moreso than any technical hurdle.

magical commented 4 years ago

there is no short supply of people willing to help

If y'all are serious about wanting to help, please join irc and start asking questions. It's lonely in here :(

(Also we have a discord now i guess?? Join IRC or message eevee for an invite.)

jnsnow commented 4 years ago

I'd love to get a code tour if someone had time to bootstrap me. I tried using the repo before but wasn't too successful with the tooling, but the CSVs have been really helpful in the past.

takeshi-codes commented 4 years ago

is there anything I can do to help get this feature added?

route1rodent commented 4 years ago

Gen 8 update should happen IMHO, with or without game data dumps (using Bulbapedia or Serebii as source).

ROM data dumps parsing as a rule are always keeping people away from contributing to this project, so I'd rather have this project as an open DB where everyone is free to contribute with the missing pieces.

That requires of course other people to review the correctness of the data.

Since there is no clear direction for this project anymore, I'd rather go on that direction.

route1rodent commented 4 years ago

@ctrlaltdylang I'd say people to start updating the CSV files for gen8 in a Pull Request otherwise this will be delayed indefinitely .

We can join gen-8 related Pull Requests in a common branch different than master, to do a big release at the end.

takeshi-codes commented 4 years ago

@route1rodent I should have some downtime over the next couple weeks, I'll try to get a PR up by the end of the month. I plan on using the pokedex here to support a project I'm working on.

takeshi-codes commented 4 years ago

@route1rodent is there a good PR I can reference to make sure I'm adding everything correctly?

jnsnow commented 4 years ago

I'm just a random yokel, but I think it's very important NOT to source anything on Bulbapedia. Anything that is crowdsourced or informally gathered is unverifiable and shouldn't be codified as truth.

route1rodent commented 4 years ago

@jnsnow I understand, but it's better than having nothing, and you can always verify using multiple sources. Once the scripts for importing game data directly are ready then we can use those instead.

route1rodent commented 4 years ago

@ctrlaltdylang I don't know, for this huge update editing CSV directly can be tricky and more prone to errors. easiest way is to dump the CSV into a sqlite DB using the CLI executable, do your changes in the DB with a Sqlite SQL client and dump that back to the CSV files.

MusicDev33 commented 4 years ago

I think just editing CSV would probably be the way to go, I don't think anyone's going to dump a PR here where they've edited all of the CSV, I imagine it'll mostly be small(ish) changes, which can be reviewed. Or maybe that can be a rule. Just my two cents here, I'm very pro-edit the CSV directly in the interest of getting Gen 8 here out as quick as possible. Especially when a new DLC is 4 days away and another a couple more months out (new moves and new Pokemon to tackle)

NoelDavies commented 4 years ago

This is the most accurate and up to date thing I've managed to come across for Gen 8. https://gist.github.com/NoelDavies/36be9a000d55e9f0a6e8ec509da2fc90 Hope this helps somebody (I started working on a node script to parse, and format this into a JSON object - just haven't finished it yet) - The issue we have is multiple forms (I've not implemented this into Pokélink yet) but they're all there

oliver-ni commented 4 years ago

I'd be happy to help work on this- not sure where to start though.

allowthisfam commented 4 years ago

Bump. Any updates?

magical commented 3 years ago

Status update: I spent some time at the end of June seriously looking into this. It's been a while, so some of that time was spent figuring out Switch hacking tools and generally derusting my knowledge, but I got some preliminary data (level-up moves and egg moves) dumped from Sword. No ETA, but Gen 8 data will definitely happen sooner or later.

If you want to assist with the ripping process, feel free to join the veekun discord. I could use some help figuring out how to apply the 1.2.0 update of Sw/Sh. Also, just having other people to talk to who are interested in datamining would help with motivation.

That said, my top priority right now is rescuing the website from complete and total decay (cf https://github.com/veekun/spline-pokedex/issues/115), and ripping Gen 8 data is currently taking a second seat to that.

In the meantime, PokeAPI has created a temporary pokedex fork with the intent of adding data scraped from Bulbapedia and other sources. If you just want the data and can't wait, maybe check that out.

justingolden21 commented 3 years ago

That's awesome to hear. Thanks for all the hard work, I'm sure many people appreciate it : )

NoelDavies commented 3 years ago

@magical - We at Pokelink would love to help out if we can :)

RyanVereque commented 3 years ago

Forward: New Git user here by the way, looking for some advice on how to contribute to this issue. I'm aware this repo relies on automated collection from game data first, automated collection from reputable data mining projects second, and well whatever I did not as much really. I'm aware I can make my own fork, but I'm interested to know whether any of this can be pulled by this project, or the Pokeapi fork. Or perhaps some other alternative I'm not thinking of ?

I've managed to scrape data (automated via a Node project) from Bulbapedia for some tables for the veekun database. The new data primarily comes from gen 8, but not exclusively (any generation-specific data I added that would conflict with other generations would exist in a table that has an existing gen/version/region column, so any previous-gen data can be filtered out).

The data mostly consists of (Gen8) pokemon species, variety (alternative variations included), form, their names (english), dex values, pokemon base stats, types. No moves, abilities, encounters etc.

This data can of course be dumped into changes for the csv files. If anyone is interested in me making a fork with changes to the csv files, and/or my scraping project code, let me know.

The new gathered data pertains to the following tables and columns in the database:



Additionally, I've scraped Bulbapedia data (complete dataset) for some things that have changed across generations such as pokemon types, base-stats, and exp gain, but this kind of per-gen data for these kinds of information do not seem to be supported in the current schema. Am I correct assuming this repo only intends to reflect this kind of info for whichever latest generation it supports, and exclude previous-gen data ? If it were to include this info oneday, would it follow the *-changelog format ? Mine didn't comply with that format, but it could be converted with ease.

Examples:

CUSTOM_pokemon_stats_by_gen

pokemon_id generation_id stat_id base_stat effort
25 1 3 30 NULL
25 2 3 30 NULL
25 3 3 30 0
25 4 3 30 0
25 5 3 30 0
25 6 3 40 0
25 7 3 40 0
25 8 3 40 0

CUSTOM_pokemon_type_changes

pokemon_id generation_id from_type_id to_type_id slot
39 6 NULL 18 2
40 6 NULL 18 2
81 2 NULL 9 2
82 2 NULL 9 2

CUSTOM_pokemon_exp_gain_by_gen

pokemon_id generation_id exp
13 3 52
13 4 52
13 5 39
13 6 39
13 7 39
magical commented 3 years ago

@RyanVereque I'm not sure exactly what you're asking. A couple of broad comments:

  1. You should check out PokeAPI's fork. We won't be accepting data scraped from Bulbapedia here.

  2. We're kind of trying to move away from changelog tables. For stat/type changes the solution is probably to add a generation column to the existing tables (or maybe a first/last generation range). There's an open issue for this; see https://github.com/veekun/pokedex/issues/107.