Open RaenonX opened 1 year ago
My first pass idea about something like this was in https://github.com/jeancroy/RP-fit/blob/main/rp_model/utils/files.py#L36
Basically use the fact they are all the same pattern but the hash part. Then use filesystem metadata to order them and keep top K.
If everything is copied to a worker for CI, I can see filesystem metadata being not the best.
An alternative design is a single file, always with the same name. But the stuff that is pickled is wrapped with metadata about when the result is valid. (Ie hash of the data ans maybe date of the last fit)
Date in the filename tells us which are the old result to delete, but not if the new is still valid. The whole idea behind download_sheet was that working with gsheet api and it's auth was a bit of a pain.
My first pass idea about something like this was in https://github.com/jeancroy/RP-fit/blob/main/rp_model/utils/files.py#L36
Basically use the fact they are all the same pattern but the hash part. Then use filesystem metadata to order them and keep top K.
If everything is copied to a worker for CI, I can see filesystem metadata being not the best.
An alternative design is a single file, always with the same name. But the stuff that is pickled is wrapped with metadata about when the result is valid. (Ie hash of the data ans maybe date of the last fit)
That's likely copied and I am not sure if the file metadata will get copied over. I would go against using file metadata as it tends to be more error-prone.
If single with same name but applied a wrapper, that should do the work (and I actually thought about that).
Date in the filename tells us which are the old result to delete, but not if the new is still valid. The whole idea behind download_sheet was that working with gsheet api and it's auth was a bit of a pain.
Could you elaborate on the auth thing? At least from what I can see it only uses a certain URL to export data as csv, not seeing the auth part. Also not really getting the but not if the new is still valid. part, do you have an example?
I think I miss words in that sentence, sorry about that.
Basically last time I looked at google sheet api, it needed a per document setup of dev key. And I didn't had access to the sheet. So I just used dev console (F12) and check the network data when you download sheet as csv.
The goal of the hash is to invalidate the previous fit if the source data changed. But sometime the source data will not change. And I wanted to not redo the fit. And it was hard to detect changes in the source data without hash.
For example I don't really see a last modified date here https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/sheets#SheetProperties
If single with same name but applied a wrapper, that should do the work (and I actually thought about that).
Yeah I should probably go with some kind of ConditionallyValidData
or HashValidatedData
abstraction with their own save and load. Then there's no difference between pickle don't exist and pickle exist with invalid data.
I kinda like the parallel with memoization and in that context multiple saved result is not bad. But that's not really how we use those now.
I think I miss words in that sentence, sorry about that.
Basically last time I looked at google sheet api, it needed a per document setup of dev key. And I didn't had access to the sheet. So I just used dev console (F12) and check the network data when you download sheet as csv.
The goal of the hash is to invalidate the previous fit if the source data changed. But sometime the source data will not change. And I wanted to not redo the fit. And it was hard to detect changes in the source data without hash.
For example I don't really see a last modified date here https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/sheets#SheetProperties
Seems like it needs drive API. Personally I feel hashing should be enough, we can just include the timestamps in whatever we're looking.
I kinda like the parallel with memoization and in that context multiple saved result is not bad. But that's not really how we use those now.
I'd say given the folder structure below:
files/
1700068475-deadbeef/
pokemon.pickle
1699988888-abcdefaa/
pokemon.pickle
With the Python pseudo code below:
# TODO: Implement custom sorting on `DirectoryMeta`, maybe start with `sort_index`
# https://blog.logrocket.com/understanding-python-dataclasses/
@dataclass
class DirectoryMeta:
created_at: datetime
hash_: str
@staticmethod
def parse_from_name(name: str) -> DirectoryMeta:
created_at_str, hash_ = name.split(" ", 1)
return DirectoryMeta(created_at=datetime.strptime(created_at_str), hash_=hash)
sorted_directory_meta = sorted(DirectoryMeta.parse_from_name(directory) for directory in os.listdir("./files"))
for meta in sorted_directory_meta:
if meta.hash_ == hash_of_the_current:
return # Terminates because the current hash matches
... # Do whatever needed if the current hash doesn't match
Both memoization and parallel results are achieved. Adding some tweak, wrapping the above impl in a function, and have an option class could also allow overridding the result we're looking for, if needed.
Ok I pushed something in dev that explore the wrapper idea. You can tell me what you think of it.
I agree there could be something like make settings part of the path and then for given settings data change is handled by the store. But for now I don't feel we use parallel enough to warrant that effort.
Btw as-is I can't run api.py with the if name==main way.
I have ImportError: attempted relative import with no known parent package
I can remove the .
in front of .rp_model
and that fix me, but i'm sure it break you so I've put it back.
Tested and getting this, I think in some way the path passed in has to be converted to absolute.
RP model file path: C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\data\rp_model
RP model pickle hash mismatch, generating new file...
Iteration Total nfev Cost Cost reduction Step norm Optimality
0 1 1.5649e+04 1.18e+07
1 3 8.3648e+03 7.28e+03 2.17e+00 1.61e+06
2 7 6.2849e+03 2.08e+03 3.39e-02 1.34e+06
3 8 5.5675e+03 7.17e+02 3.39e-02 5.03e+05
4 9 5.3717e+03 1.96e+02 3.39e-02 3.15e+03
5 10 5.1534e+03 2.18e+02 6.77e-02 3.14e+03
6 11 4.7244e+03 4.29e+02 1.35e-01 3.14e+03
7 12 3.8694e+03 8.55e+02 2.71e-01 3.14e+03
8 13 2.1612e+03 1.71e+03 5.42e-01 3.15e+03
9 15 1.9522e+03 2.09e+02 2.71e-01 4.31e+03
10 16 1.7362e+03 2.16e+02 6.77e-02 3.15e+03
11 18 1.6786e+03 5.76e+01 3.39e-02 1.98e+03
12 19 1.6756e+03 3.04e+00 3.39e-02 7.37e+01
13 20 1.6708e+03 4.78e+00 6.77e-02 6.63e+01
14 21 1.6628e+03 8.05e+00 1.35e-01 2.49e+02
15 22 1.6525e+03 1.02e+01 2.71e-01 9.86e+02
16 24 1.6504e+03 2.17e+00 1.35e-01 2.46e+02
17 26 1.6500e+03 3.46e-01 6.77e-02 6.08e+01
18 29 1.6500e+03 8.61e-03 8.47e-03 8.63e-01
19 31 1.6500e+03 1.42e-03 4.23e-03 1.78e-01
20 33 1.6500e+03 5.45e-05 2.12e-03 6.50e-03
21 38 1.6500e+03 1.30e-08 1.65e-05 1.46e-03
`ftol` termination condition is satisfied.
Function evaluations 38, initial cost 1.5649e+04, final cost 1.6500e+03, first-order optimality 1.46e-03.
Traceback (most recent call last):
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\run_rp_model.py", line 10, in <module>
main()
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\run_rp_model.py", line 6, in main
get_rp_model_df()
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\entry.py", line 5, in get_rp_model_df
return get_rp_model_result("results/{hash_id}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\api.py", line 89, in get_rp_model_result
return update_fit_cached()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\api.py", line 27, in update_fit_cached
.save_to(FitOptions.result_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\rp_model\utils\store.py", line 34, in save_to
save(path, self)
File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\rp_model\utils\files.py", line 19, in save
with open(filepath, "wb") as handle:
^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results/{hash_id}.pickle'
from calc.rp_model import get_rp_model_result
def get_rp_model_df():
return get_rp_model_result("results/{hash_id}.pickle")
Ok I convert the incoming path get_rp_model_result to absolute from the file dir.
Ok I convert the incoming path get_rp_model_result to absolute from the file dir.
Had to remove L6 of the root __init__.py
, everything else looks fine.
Solutions that might work:
Instead of hashing, we can just use either of the one mentioned above to grab the "latest" cached file. And if the data count or the file hash mismatches, remove the cached file and reload Pokemon data.