jeancroy / RP-fit

MIT License
11 stars 1 forks source link

Pokemon Info Website Scraper API #1

Closed RaenonX closed 10 months ago

RaenonX commented 10 months ago

Desired Behavior

Global

Manual trigger is likely run whenever a new mon gets released and enough data has been collected.


The API will be used in https://github.com/RaenonX-PokemonSleep/pokemon-sleep-scraper. Currently the scraper will use the copy-pasted code from https://github.com/jeancroy/RP-fit/commit/78ff158e297496e2d7562250c45537eada9390cd until the API is done.

jeancroy commented 10 months ago

I suppose CI have some unique way to persist data between runs ? Also what is result, do we need to compute RP on the same set as data ? Or some others data-set that is useful for the db ?

  data, pokedex = get_pokemon_data()
  uniqueId = digest(data)  
  fit = try_get_from_storage(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      try_push_to_storage(uniqueId, fit)

 result = run_rp_model(model, data)
 return result     
RaenonX commented 10 months ago

I suppose CI have some unique way to persist data between runs ?

  data, pokedex = get_pokemon_data()
  uniqueId = digest(data)  
  fit = try_get_from_storage(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      try_push_to_storage(uniqueId, fit)

 result = run_rp_model(model, data)
 return result     

CI can persist data, but it's more like...caching data. This is the sample piece that caches pip dependencies:

- task: Cache@2
  displayName: 'Cache pip'
  inputs:
    key: 'pip | "$(Agent.OS)" | requirements.txt'
    restoreKeys: |
      pip | "$(Agent.OS)"
    path: $(PIP_CACHE_DIR)

So I guess in this case, key stores will be a constant value here, and path is the path that's an absolute path on the CI that stored the pickled file. $(PIP_CACHE_DIR) was $(Pipeline.Workspace)/.pip, so I'll have to investigate what path should be.

I am against storing in the repo root since the scraper will link this repo as submodule, so it wouldn't be ideal for path control. Something like either environment variable or a param to pass in might be better.

jeancroy commented 10 months ago

I think the best would be dependency inversion.

def get_rp_model_results(storage, options ):
  data, pokedex = get_pokemon_data(options)
  uniqueId = digest(data)  
  fit = storage.try_get(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      storage.put(uniqueId, fit)

    result = run_rp_model(model, data)
    return result     
RaenonX commented 10 months ago

I think the best would be dependency inversion.

def get_rp_model_results(storage, options ):
  data, pokedex = get_pokemon_data(options)
  uniqueId = digest(data)  
  fit = storage.try_get(uniqueId)

  if fit is None:
      fit = fit_rp_model(data)
      storage.put(uniqueId, fit)

    result = run_rp_model(model, data)
    return result     

Agree, if storage here actually just does file operation and entering directory path is easy.