Closed RaenonX closed 1 year ago
I suppose CI have some unique way to persist data between runs ? Also what is result, do we need to compute RP on the same set as data ? Or some others data-set that is useful for the db ?
data, pokedex = get_pokemon_data()
uniqueId = digest(data)
fit = try_get_from_storage(uniqueId)
if fit is None:
fit = fit_rp_model(data)
try_push_to_storage(uniqueId, fit)
result = run_rp_model(model, data)
return result
I suppose CI have some unique way to persist data between runs ?
data, pokedex = get_pokemon_data() uniqueId = digest(data) fit = try_get_from_storage(uniqueId) if fit is None: fit = fit_rp_model(data) try_push_to_storage(uniqueId, fit) result = run_rp_model(model, data) return result
CI can persist data, but it's more like...caching data. This is the sample piece that caches pip
dependencies:
- task: Cache@2
displayName: 'Cache pip'
inputs:
key: 'pip | "$(Agent.OS)" | requirements.txt'
restoreKeys: |
pip | "$(Agent.OS)"
path: $(PIP_CACHE_DIR)
So I guess in this case, key
stores will be a constant value here, and path
is the path that's an absolute path on the CI that stored the pickled file. $(PIP_CACHE_DIR)
was $(Pipeline.Workspace)/.pip
, so I'll have to investigate what path
should be.
I am against storing in the repo root since the scraper will link this repo as submodule, so it wouldn't be ideal for path control. Something like either environment variable or a param to pass in might be better.
I think the best would be dependency inversion.
def get_rp_model_results(storage, options ):
data, pokedex = get_pokemon_data(options)
uniqueId = digest(data)
fit = storage.try_get(uniqueId)
if fit is None:
fit = fit_rp_model(data)
storage.put(uniqueId, fit)
result = run_rp_model(model, data)
return result
I think the best would be dependency inversion.
def get_rp_model_results(storage, options ): data, pokedex = get_pokemon_data(options) uniqueId = digest(data) fit = storage.try_get(uniqueId) if fit is None: fit = fit_rp_model(data) storage.put(uniqueId, fit) result = run_rp_model(model, data) return result
Agree, if storage
here actually just does file operation and entering directory path is easy.
Desired Behavior
Global
Fitting
Running Model
dict
orDataframe
with the following fields (this is what currently the website directly stores, doesn't need to be exactly the same, but values are needed):dataCount
: Count of data in Pokemon data. Sample value:17
ingredientSplit
: Ingredient rate. Sample value:0.3384
skillValue
: Skill value (skill% x skill value). Sample value:58.94
confidence
: A pythondict
with the keys ofingredient
andskill
. Sample value:0.1022
(conf (xxx)*
) in the bootstrap onepokemonId
: Pokemon ID. Sample value:25
(Pikachu)skillPercent
: Only needed if there's a field for the model to consume skill value and calculate skill %. Sample value:6.512
API (Prototype)
get_rp_model_results(pokemon_data_file_id: string, pokemon_data_sheet_id_mapping: dict[Literal["data", "pokedex"], string]) -> DataFrame | dict
Output type is either
DataFrame
ordict
, shouldn't be both.Pseudo code draft
(For context) Scraper running routine
Pokemon data gets entered almost everyday, especially when new Pokemon is released. The CI pipeline that has the scraper on it will run during these time:
Manual trigger is likely run whenever a new mon gets released and enough data has been collected.
The API will be used in https://github.com/RaenonX-PokemonSleep/pokemon-sleep-scraper. Currently the scraper will use the copy-pasted code from https://github.com/jeancroy/RP-fit/commit/78ff158e297496e2d7562250c45537eada9390cd until the API is done.