numberscope / backscope

Numberscope's back end: responsible for getting sequences and other data from the On-Line Encyclopedia of Integer Sequences, pre-processing it (factoring etc), and storing it.
MIT License
1 stars 9 forks source link

Backend can duplicate work #33

Open gwhitney opened 2 years ago

gwhitney commented 2 years ago

If you make the same time-intensive request in quick succession, so the second request comes in before the backend is done pulling all the info and putting it in the database, is the backend smart enough not to start a long series of requests redundantly twice? Will this cause trouble?

_Originally posted by @katestange in https://github.com/numberscope/backscope/pull/32#discussion_r863250245_

In fact, the backend is currently not smart enough to avoid this, although Postgres should ensure that the database ends up ok. Nevertheless, backscope should be enhanced to know when a sequence is in progress of being grabbed and not start another job doing the same thing.

katestange commented 2 years ago

The same issue will apply to backend computations, such as factoring. These may be wildly intensive, so we'll need to be careful there. (Fortunately, the time taken to factor is at relatively predictable as a function of the number of digits.)

gwhitney commented 2 years ago

I agree. One question is whether requesting a sequence will initiate the factoring job, or whether there will be a periodic sweep to see if any sequences need factoring. The latter approach won't suffer from replicated work, whereas in the former approach we definitely need to check for duplicate factoring jobs right from the get-go -- definitely can't ever just leave things the way they are now and just launch into factoring every new sequence that comes along, because they might be duplicates. Most likely, given that factoring is expensive, we should have a queue of jobs (rather than running them in parallel), which makes checking whether a sequence is in the queue easy -- and then we just have to make sure that a sequence isn't removed from the queue until the database update with the factorization is complete.

As all this is a bit of work, I think the motivation needs to be a worthwhile visualizer that uses the factorizations, rather than building the factorizer looking for an application. When there's a visualizer that needs it, then it will be worth building the factorizer. To avoid the chicken & egg problem, the visualizer can be developed on a few "canned" factorizations.

katestange commented 2 years ago

Agreed. Maybe this should motivate me to finish the factorization visualizer I had started. (Another interesting question is whether we should cache factorizations separately from sequences. I would bet that if you did a histogram of how often each integer appears in the OEIS, it's very very far from uniform. So I can imagine there are some very big numbers that would have factorization requests repeatedly, and we could store them in some type of sparse array. What I'm saying here is now veering fairly far from the original issue, but we can revisit these ideas someday.)

gwhitney commented 2 years ago

Oh, I see, you want to have a separate table of factorizations of individual integers, rather than factoring (the entries of) sequences and storing the factorizations as a column on the sequence entries. There is some sense to that, as the same integer may appear in multiple sequences, so why factor it twice? The counterpoint is that generally you grab factorizations by a sequence, and the numbers that do occur that are large enough where it makes a difference whether you factor them more than once likely occur in an expected number of sequences just very slightly over one...

So anyhow, how to factor and store the results does need to be considered; another possibility is a hybrid in which we have one table of factorizations of individual integers, and we "in the background" pre-assemble them into factorizations of sequences for ease/speed of access.

katestange commented 2 years ago

Right, it's this hybrid approach I was imagining. Not sure whether it makes sense or not, because it depends on the overhead vs. the duplicate work saved. Might be a situation in which it is worth collecting some data on usage before deciding whether to implement something like that.

gwhitney commented 2 years ago

The only overhead in the hybrid approach is coding. The only computationally nontrivial activity in the whole chain of events is the actual factoring. So if we knew that a "large" number that appears in a sequence appears in 1.00001 sequences on average, we wouldn't bother with the coding. If we knew the average was 2 sequences, we'd definitely bother with the coding. Somewhere in between, it gets grey. Since the downloadable database only includes the first 50-ish terms of each sequence, there's no way to know without systematically downloading a sizable fraction of all b-files, which would violate the license. (Well, maybe a smaller sample would suffice to estimate this expectation, but I'm not a good enough statistician to figure out the estimators, so that would be more work than just coding the hybrid approach. So frankly I think the path of least resistance is to factor by sequence, and then in some future semester when numberscope has gotten a lot of use, give an undergrad the project of checking how often we factor some number more than once, and if it's appreciable, file an issue to switch to the hybrid approach.)