Oban workers - Githubissues

w0rd-driven / beatseek

Seek the music you may be missing by analyzing your collection

https://beatseek.fly.dev/

MIT License

3 stars 0 forks source link

Oban workers #8

Closed w0rd-driven closed 1 year ago

w0rd-driven commented 1 year ago

[ ] Full scan
- [x] Each directory could be its own worker but we only have one directory for now.
[x] Full verification
- [x] Artist verification. Each artist should be retried up to 5 or 10 times.
- [x] Each artist needs a verified_at field for this.
- [x] How do we determine when the full check is completed?
  - Check the oban_jobs table? There would be 1 row for each artist except the 2 that don't update (I think they're not named correctly).

I don't know if scanning or checking needs a GenServer. The UI needs to block full scans or full or individual checks if they are currently happening. We can also run queries like "Get me the artists that were checked over 7 days ago".

We could also reschedule artist checks as the jobs are complete. It may be ideal to have the UI separate from jobs as jobs could be the periodic automatic check with the UI being one-offs. If the UI updates a timestamp, a job can do calculations first and schedule itself at a later time.

I don't want to spend too much time on edge cases though so the UI may not be a thing to cover at all.

I want Oban for the retry parts but I could make sure that the UI only creates jobs if one doesn't exist. That may be the play here.

w0rd-driven commented 1 year ago

This is a use case for doing a single full scan on a schedule.

https://hexdocs.pm/oban/reliable-scheduling.html

This is a use case for checking the API as it's technically a backfill.

https://hexdocs.pm/oban/recursive-jobs.html

w0rd-driven commented 1 year ago

The full scan interval should be once every 7 days and it should send a notification on completion. The album_id would be null here. If we trigger a scan via the UI it doesn't matter if the schedule will happen soon or not. We don't want both to overlap though but upserts should catch some race conditions. If the scan detects the UI, don't bother scanning but schedule for 7 days from now.

The UI should be a genserver as it can store associated state. Oban could use a client call function that would error the same as logic in a perform(). This is likely better than a cache or database because we don't care to persist execution or have to revert a state change on error. What I don't know off hand is if we would need to link events to persist state internally. The first call would set state and trigger another event that did the work and unset state. That seems like overkill.

This is where automated testing really shines though. I think UI triggers would be an ideal first step. The schedules are nice to haves.

w0rd-driven commented 1 year ago

The API check can always be a recursive job. Query if checked_at is not nil would backfill data. It's tempting to also do > 7 days ago on a full systems check.

For an individual artist check, we wouldn't care in either case.

We could likely use individual workers for each case or use pattern matching on unique arguments. My guess is the tests that implemented each piece of functionality could be repurposed to verify our complex rules.

There could still be race conditions if you schedule an individual check while a full check is working through a particular artist. The date condition could work but that won't be updated until completion.

I'm spending so much brain power on if I'm honest. Way too much for a capstone/MVP.

A full scan and a full check on demand triggered only via the UI seems like the play. This can still be problematic though even if the GenServer schedules one job per artist. A completely blocking operation would negate the need for Oban entirely though. The API needs retries and exponential back off too, which Oban provides. I think the play is to just implement a naive approach and race conditions be damned.

What I don't quite know in either case is how to schedule that first run. Do you check Oban on startup? There is functions in the docs on testing that might be what to do here. For a full scan there should only be one job period. For a check that's dicey as each artist is the unit of work.

w0rd-driven commented 1 year ago

There's no good place in the UI currently for either full scan or check. It's also important to remember where it'll be deployed. If this is on Fly.io then the system scan is pointless. The API check is all that matters there. Fly.io can still and likely should be "demo mode".

Scheduling a full scan on startup only if no other job exists seems like enough. The first run should schedule 7 days out and execute. For Fly.io directories won't exist so this will effectively do nothing. We don't technically need UI unless something goes wrong.

Notifications could help or trigger jobs but that likely better describes the Action type I brainstormed. Actions are notifications that typically require some user input to complete. Would a "this was scheduled on startup you don't need to do anything" message make sense as an action when no action is needed? See? Weeds. I'm in a forest of weeds right now and I need KISS and YAGNI badly.

w0rd-driven commented 1 year ago

So to negate some of the other comments for now. UI triggers are king. I've been seeding artist data by running the function in iex. That has to stop.

API calls desperately need retry and back off even if triggered via the UI. Let's always take the naive approach to both with a GenServer. Let's leave Oban for a Next or Later implementation. I think it absolutely fits but a good application needs the ability for users to correct invalid state.

w0rd-driven commented 1 year ago

After partial sleep for the night I think I'm making a mountain out of a molehill. I also may be able to quickly push out Actions if I'm lucky.

Oban will always be a queue. If you use it, there will never be a race condition I believe. With Laravel you can have multiple worker processes that could theoretically pick up 2 queued jobs. The 2nd could execute if the first is long running. I believe with Oban this shouldn't happen. If I queue a full scan then queue another one, it should wait for the first one. Yea I'm executing 2 scans and wasting resources but the race condition is what I'm worried about. I'm worried one process will start writing records while another one is running. Even if that happens, upserts should work in theory with both processes sending Pubsub.

I believe the worst that happens is you run multiple times. This does pose a problem with the UI but a full scan could be gated behind a genserver.

Set an is_scheduled flag
Schedule Oban worker or run now even
After scan function PubSub "we're done"
Unset disabled flag
Change drop-down display from Scanning... to Scan

We could also never set disabled. Why do we need to disable an action if it'll always be queued?

Actions would be useful or notifications saying something like "This is already scheduled" or you could see that you've scheduled a job. Alternatively, we could create a job display that tracks execution in real time using Pubsub. Changing state as a lock is rough but we could also have a timer on the genserver that checks for jobs to disable state when retries are completed and the job still errors. Your completed event doesn't trigger if it never moves beyond your erroneous function. A 60s timer could disable the lock so to speak.

The full API check could disable state for each drop-down as it executes? I think I'm creating a testing nightmare though and still being complicated.

A responsive UI would show scan started and scan completed messages even if the menu isn't disabled. The same could be said about individual artist checks but having to clear 100+ notifications would be a pain. A status bar could help but that's on my Next bucket. I can't even get a sticky footer working.

w0rd-driven commented 1 year ago

Knowing something is executing and having an expectation of completion would be helpful. Users click buttons multiple times when they get no feedback the first time was successful. We disable a button on the web to prevent this sort of thing.

We could also use Oban as overkill. The batch artist API could be a job that has no retries. Nope doesn't work because we're still tracking execution time for scheduling further jobs and not their actual execution.

I'm thinking how can I detect of a full check is complete. Users could prevent the poll from showing its complete by continually scheduling artist jobs. That's fine to me though actually.

I think I also need to ditch genserver until I need it. If the menu components mount as disabled because they're looking for jobs, the poll would disable that as easily as if it were triggered from an event. This makes testing much easier.

For a full scan, there will always be a job present so a mount check works 100% as an event. For a full API we could check if any artist job exists on mount. If we use the same logic as the poll, I think this works in both cases.

I believe Livebeats also sets component ids as record numbers so even though there is 3 sidebars they're also there for the case where there would be 300. Artist 4 would need to disable its component if there's a job scheduled with its id. This also likely means only disabling on handle_params or whenever the show liveview has an id.

I don't know if this is all easy to follow at this point as I need more sleep before a server migration in 2.5 hours. I need to first make drop-downs before I can work all of this out. Writing tests may be easier if functionality isn't gated in the UI though. A scan doesn't schedule itself but an API check likely would have the Oban dependency. We can have web tests but they feel like integration and less like a unit test. I just need to brainstorm this tomorrow honestly. I need a second opinion on a reasonably complex topic that won't be easy to explain or point to. Would anyone want to parse all of these comments? I barely want to. I'm hoping I'll have a magical final comment that actually lays all of this out.

I haven't implemented the check API at all and that's also why I'm stuck here. Oban would simplify it conceptually but adding a genserver layer just for testing outside of a liveview which is itself a genserver is the hang up. I'll try to read this and write up a summary with questions I can point someone to. I'm pretty sure I'm complicating it greatly.

w0rd-driven commented 1 year ago

I didn't want to change the verified_at field but it is trivial to unset it for the whole table and walk through the recursive function again. It would still be worth attempting polling though as it takes a good 5-10 minutes to churn through everything. I also need to tweak things to create separate queues and drop the retries. I thought the default was 10 but it looks like max_attempts was always 20.

w0rd-driven commented 1 year ago

The scan will need a timestamp as well as a verify timestamp in the settings table whenever that gets built. It needs something to anchor when to run. If we randomize Artist checking then that could always look for > 7 days and not be batched on the same day/time. Either is fine I think. It works surprisingly well to churn through less than 200 artists but it does take quite a few minutes even at a really fast clip. I think Oban is adding useful delays here and we're not even talking more than a second or two max. I could also be way off and it works through multiple calls per second.

w0rd-driven commented 1 year ago

I think Oban can detect completion for both methods.

For a scan, the function will iterate through all directories in a sequential fashion. Each directory should have its own scanned at timestamp. It could blindly schedule 7 days out or always run on Monday. It's hard to think through how to make sure scheduled jobs should be scheduled automatically because nothing kicks off the first schedule. I'd rely on the UI for both as they currently sit but this task really needs to be done so the first iteration should be UI triggers. Both would be cron tasks or could be. Actually they both could run every few days and be fine.

It's important to also talk through how the recursive Verification works. Next id always looks for id > current so it relies on integers. This prevents my erroneous artists from becoming a infinite loop as Spotify returns no results period. This could change if I got the Albums function working over Search though. If next_id is nil then it should be done, no more artists to check. I'm sure the SQL is valid now as this was always returning nil at first. If this is true then we can set the settings.verified_at to a timestamp to use for further scheduling.

I think a random interval works best over a cron and it should be scheduled at all times. If we can detect true completion then we can schedule the next iteration x hours out like every 3 days. We're fine with time drift. We want a full scan and verify every 72 hours. Using a date per file directory makes sense but if we run the scan but it omits a directory because it hasn't crossed the threshold then it's problematic. We could use the same backfill behavior though because we can define next_id queries.

w0rd-driven commented 1 year ago

It also took me quite a bit to realize the app section where children and the supervision tree is located is called startup. Livebook uses this to run functions first and we can also perform checks to see if the oban workers are on schedule. I feel like that part is tricky enough to leave until later as long as there is UI to trigger a full scan and the full or individual verify.

A scan is always contiguous too. I could break it down to split a unit of work into each directory as a pool would potentially speed up scans but that isn't a silver bullet for what is already only a few seconds for my collection. If we included the duration calculation, which is a much longer scan, then yeah we should look into techniques like this. The worker could verify the completion of the relational record(s). This also makes sense when you incorporate tracks or other aspects. Because I'm shaving down to 1 file per directory the scan is as tight as possible. Getting the id3 tag of every track is going to take significantly longer. If this were a proper music app, I would look to distributing the workload. I still think YAGNI though, for now at least.

w0rd-driven commented 1 year ago

I wanna close this in favor of a separate issue to wire up the frontend. I need more contained issues. This should've been a couple of issues, possibly splitting the work. I tackled what made sense first as I could trigger a full scan or verify single artists more easily.