Remove old Pokemon data file automatically

RaenonX commented 1 year ago

Solutions that might work:

Add timestamp into file name
Add data count into file name

Instead of hashing, we can just use either of the one mentioned above to grab the "latest" cached file. And if the data count or the file hash mismatches, remove the cached file and reload Pokemon data.

jeancroy commented 1 year ago

My first pass idea about something like this was in https://github.com/jeancroy/RP-fit/blob/main/rp_model/utils/files.py#L36

Basically use the fact they are all the same pattern but the hash part. Then use filesystem metadata to order them and keep top K.

If everything is copied to a worker for CI, I can see filesystem metadata being not the best.

An alternative design is a single file, always with the same name. But the stuff that is pickled is wrapped with metadata about when the result is valid. (Ie hash of the data ans maybe date of the last fit)

jeancroy commented 1 year ago

Date in the filename tells us which are the old result to delete, but not if the new is still valid. The whole idea behind download_sheet was that working with gsheet api and it's auth was a bit of a pain.

RaenonX commented 1 year ago

My first pass idea about something like this was in https://github.com/jeancroy/RP-fit/blob/main/rp_model/utils/files.py#L36

Basically use the fact they are all the same pattern but the hash part. Then use filesystem metadata to order them and keep top K.

If everything is copied to a worker for CI, I can see filesystem metadata being not the best.

An alternative design is a single file, always with the same name. But the stuff that is pickled is wrapped with metadata about when the result is valid. (Ie hash of the data ans maybe date of the last fit)

That's likely copied and I am not sure if the file metadata will get copied over. I would go against using file metadata as it tends to be more error-prone.

If single with same name but applied a wrapper, that should do the work (and I actually thought about that).

RaenonX commented 1 year ago

Date in the filename tells us which are the old result to delete, but not if the new is still valid. The whole idea behind download_sheet was that working with gsheet api and it's auth was a bit of a pain.

Could you elaborate on the auth thing? At least from what I can see it only uses a certain URL to export data as csv, not seeing the auth part. Also not really getting the but not if the new is still valid. part, do you have an example?

jeancroy commented 1 year ago

I think I miss words in that sentence, sorry about that.

Basically last time I looked at google sheet api, it needed a per document setup of dev key. And I didn't had access to the sheet. So I just used dev console (F12) and check the network data when you download sheet as csv.

The goal of the hash is to invalidate the previous fit if the source data changed. But sometime the source data will not change. And I wanted to not redo the fit. And it was hard to detect changes in the source data without hash.

For example I don't really see a last modified date here https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/sheets#SheetProperties

jeancroy commented 1 year ago

If single with same name but applied a wrapper, that should do the work (and I actually thought about that).

Yeah I should probably go with some kind of ConditionallyValidData or HashValidatedData abstraction with their own save and load. Then there's no difference between pickle don't exist and pickle exist with invalid data.

I kinda like the parallel with memoization and in that context multiple saved result is not bad. But that's not really how we use those now.

RaenonX commented 1 year ago

I think I miss words in that sentence, sorry about that.

Basically last time I looked at google sheet api, it needed a per document setup of dev key. And I didn't had access to the sheet. So I just used dev console (F12) and check the network data when you download sheet as csv.

The goal of the hash is to invalidate the previous fit if the source data changed. But sometime the source data will not change. And I wanted to not redo the fit. And it was hard to detect changes in the source data without hash.

For example I don't really see a last modified date here https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/sheets#SheetProperties

Seems like it needs drive API. Personally I feel hashing should be enough, we can just include the timestamps in whatever we're looking.

I kinda like the parallel with memoization and in that context multiple saved result is not bad. But that's not really how we use those now.

I'd say given the folder structure below:

files/
  1700068475-deadbeef/
    pokemon.pickle
  1699988888-abcdefaa/
    pokemon.pickle

With the Python pseudo code below:

# TODO: Implement custom sorting on `DirectoryMeta`, maybe start with `sort_index`
# https://blog.logrocket.com/understanding-python-dataclasses/
@dataclass
class DirectoryMeta:
  created_at: datetime
  hash_: str

  @staticmethod
  def parse_from_name(name: str) -> DirectoryMeta:
    created_at_str, hash_ = name.split(" ", 1)

    return DirectoryMeta(created_at=datetime.strptime(created_at_str), hash_=hash)

sorted_directory_meta = sorted(DirectoryMeta.parse_from_name(directory) for directory in os.listdir("./files"))
for meta in sorted_directory_meta:
  if meta.hash_ == hash_of_the_current:
    return  # Terminates because the current hash matches
  ...  # Do whatever needed if the current hash doesn't match

Both memoization and parallel results are achieved. Adding some tweak, wrapping the above impl in a function, and have an option class could also allow overridding the result we're looking for, if needed.

jeancroy commented 1 year ago

Ok I pushed something in dev that explore the wrapper idea. You can tell me what you think of it.

I agree there could be something like make settings part of the path and then for given settings data change is handled by the store. But for now I don't feel we use parallel enough to warrant that effort.

jeancroy commented 1 year ago

Btw as-is I can't run api.py with the if name==main way. I have ImportError: attempted relative import with no known parent package I can remove the . in front of .rp_model and that fix me, but i'm sure it break you so I've put it back.

RaenonX commented 1 year ago

Tested and getting this, I think in some way the path passed in has to be converted to absolute.

Error Log

RP model file path: C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\data\rp_model
RP model pickle hash mismatch, generating new file...
   Iteration     Total nfev        Cost      Cost reduction    Step norm     Optimality   
       0              1         1.5649e+04                                    1.18e+07    
       1              3         8.3648e+03      7.28e+03       2.17e+00       1.61e+06    
       2              7         6.2849e+03      2.08e+03       3.39e-02       1.34e+06    
       3              8         5.5675e+03      7.17e+02       3.39e-02       5.03e+05    
       4              9         5.3717e+03      1.96e+02       3.39e-02       3.15e+03    
       5             10         5.1534e+03      2.18e+02       6.77e-02       3.14e+03    
       6             11         4.7244e+03      4.29e+02       1.35e-01       3.14e+03    
       7             12         3.8694e+03      8.55e+02       2.71e-01       3.14e+03    
       8             13         2.1612e+03      1.71e+03       5.42e-01       3.15e+03    
       9             15         1.9522e+03      2.09e+02       2.71e-01       4.31e+03    
      10             16         1.7362e+03      2.16e+02       6.77e-02       3.15e+03    
      11             18         1.6786e+03      5.76e+01       3.39e-02       1.98e+03    
      12             19         1.6756e+03      3.04e+00       3.39e-02       7.37e+01    
      13             20         1.6708e+03      4.78e+00       6.77e-02       6.63e+01    
      14             21         1.6628e+03      8.05e+00       1.35e-01       2.49e+02    
      15             22         1.6525e+03      1.02e+01       2.71e-01       9.86e+02    
      16             24         1.6504e+03      2.17e+00       1.35e-01       2.46e+02    
      17             26         1.6500e+03      3.46e-01       6.77e-02       6.08e+01    
      18             29         1.6500e+03      8.61e-03       8.47e-03       8.63e-01    
      19             31         1.6500e+03      1.42e-03       4.23e-03       1.78e-01    
      20             33         1.6500e+03      5.45e-05       2.12e-03       6.50e-03    
      21             38         1.6500e+03      1.30e-08       1.65e-05       1.46e-03    
`ftol` termination condition is satisfied.
Function evaluations 38, initial cost 1.5649e+04, final cost 1.6500e+03, first-order optimality 1.46e-03.
Traceback (most recent call last):
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\run_rp_model.py", line 10, in <module>
    main()
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\run_rp_model.py", line 6, in main
    get_rp_model_df()
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\entry.py", line 5, in get_rp_model_df
    return get_rp_model_result("results/{hash_id}.pickle")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\api.py", line 89, in get_rp_model_result
    return update_fit_cached()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\api.py", line 27, in update_fit_cached
    .save_to(FitOptions.result_file)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\rp_model\utils\store.py", line 34, in save_to
    save(path, self)
  File "C:\Users\RaenonX\Documents\Projects\PokemonSleep\pokemon-sleep-scraper\calc\rp_model\rp_model\utils\files.py", line 19, in save
    with open(filepath, "wb") as handle:
         ^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results/{hash_id}.pickle'

Script

from calc.rp_model import get_rp_model_result

def get_rp_model_df():
    return get_rp_model_result("results/{hash_id}.pickle")

jeancroy commented 1 year ago

Ok I convert the incoming path get_rp_model_result to absolute from the file dir.

RaenonX commented 1 year ago

Ok I convert the incoming path get_rp_model_result to absolute from the file dir.

Had to remove L6 of the root __init__.py, everything else looks fine.

jeancroy / RP-fit

Remove old Pokemon data file automatically #3

Error Log

Script