Pokemon Info Website Scraper API

RaenonX commented 10 months ago

Desired Behavior

Global

[ ] Easily changeable G Sheet ID (and sheet ID)
Fitting
[ ] Pokedex data to source from G sheet
- Shouldn't pickle because it's possible to add a new Pokemon between runs
[ ] Pokemon data loads every run, but pickled with hash to know if it's updated
[ ] Model fit runs only if Pokemon data changes
Running Model
[ ] Running a single method outputs either Python dict or Dataframe with the following fields (this is what currently the website directly stores, doesn't need to be exactly the same, but values are needed):
- dataCount: Count of data in Pokemon data. Sample value: 17
- ingredientSplit: Ingredient rate. Sample value: 0.3384
- skillValue: Skill value (skill% x skill value). Sample value: 58.94
- confidence: A python dict with the keys of ingredient and skill. Sample value: 0.1022 (conf (xxx)*) in the bootstrap one
- pokemonId: Pokemon ID. Sample value: 25 (Pikachu)
- skillPercent: Only needed if there's a field for the model to consume skill value and calculate skill %. Sample value: 6.512
  API (Prototype)
  
  get_rp_model_results(pokemon_data_file_id: string, pokemon_data_sheet_id_mapping: dict[Literal["data", "pokedex"], string]) -> DataFrame | dict
Output type is either DataFrame or dict, shouldn't be both.

Pseudo code draft
```
def get_rp_model_results():
pokedex = get_pokedex()
is_updated, data = get_pokemon_data()

if is_updated:
model = fit_rp_model(data)
else:
model = load_rp_model_from_pickle()

result = run_rp_model(model, data)
return result
```
(For context) Scraper running routine

Pokemon data gets entered almost everyday, especially when new Pokemon is released. The CI pipeline that has the scraper on it will run during these time:
16:30 CT (US Central) every day
Manual trigger

Manual trigger is likely run whenever a new mon gets released and enough data has been collected.

The API will be used in https://github.com/RaenonX-PokemonSleep/pokemon-sleep-scraper. Currently the scraper will use the copy-pasted code from https://github.com/jeancroy/RP-fit/commit/78ff158e297496e2d7562250c45537eada9390cd until the API is done.

jeancroy commented 10 months ago

I suppose CI have some unique way to persist data between runs ? Also what is result, do we need to compute RP on the same set as data ? Or some others data-set that is useful for the db ?

  data, pokedex = get_pokemon_data()
  uniqueId = digest(data)  
  fit = try_get_from_storage(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      try_push_to_storage(uniqueId, fit)

 result = run_rp_model(model, data)
 return result

RaenonX commented 10 months ago

I suppose CI have some unique way to persist data between runs ?

  data, pokedex = get_pokemon_data()
  uniqueId = digest(data)  
  fit = try_get_from_storage(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      try_push_to_storage(uniqueId, fit)

 result = run_rp_model(model, data)
 return result

CI can persist data, but it's more like...caching data. This is the sample piece that caches pip dependencies:

- task: Cache@2
  displayName: 'Cache pip'
  inputs:
    key: 'pip | "$(Agent.OS)" | requirements.txt'
    restoreKeys: |
      pip | "$(Agent.OS)"
    path: $(PIP_CACHE_DIR)

So I guess in this case, key stores will be a constant value here, and path is the path that's an absolute path on the CI that stored the pickled file. $(PIP_CACHE_DIR) was $(Pipeline.Workspace)/.pip, so I'll have to investigate what path should be.

I am against storing in the repo root since the scraper will link this repo as submodule, so it wouldn't be ideal for path control. Something like either environment variable or a param to pass in might be better.

jeancroy commented 10 months ago

I think the best would be dependency inversion.

def get_rp_model_results(storage, options ):
  data, pokedex = get_pokemon_data(options)
  uniqueId = digest(data)  
  fit = storage.try_get(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      storage.put(uniqueId, fit)

    result = run_rp_model(model, data)
    return result

RaenonX commented 10 months ago

I think the best would be dependency inversion.

def get_rp_model_results(storage, options ):
  data, pokedex = get_pokemon_data(options)
  uniqueId = digest(data)  
  fit = storage.try_get(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      storage.put(uniqueId, fit)

    result = run_rp_model(model, data)
    return result

Agree, if storage here actually just does file operation and entering directory path is easy.

jeancroy / RP-fit

Pokemon Info Website Scraper API #1

Desired Behavior

Global

Fitting

Running Model

API (Prototype)

`get_rp_model_results(pokemon_data_file_id: string, pokemon_data_sheet_id_mapping: dict[Literal["data", "pokedex"], string]) -> DataFrame | dict`

Pseudo code draft

(For context) Scraper running routine