numberscope / backscope

Numberscope's back end: responsible for getting sequences and other data from the On-Line Encyclopedia of Integer Sequences, pre-processing it (factoring etc), and storing it.
MIT License
1 stars 9 forks source link

Switch over to GitHub interface to access some or all of OEIS data #130

Open gwhitney opened 5 months ago

gwhitney commented 5 months ago

See https://github.com/oeis/oeisdata

Could be much less limited channel than hitting OEIS's web api.

gwhitney commented 1 week ago

Because of #161 and because the new search format actually seems to be harder for us to deal with, we are planning to proceed as described in this issue sooner rather than later, at which point #161 will become moot because we will be doing our own searches as opposed to connecting to the OEIS server. At that point it seems that we will only connect to the OEIS server to download b-files as needed.

However, there are still several implementation choices.

  1. For information that is easily extractable from the OEIS data files, such as the name of a sequence, are we going to just extract that information each time we need it, or are we going to parse it out either (a) once and for all for every sequences or (b) for a given sequence when asked, and store that information in our postgres database? The latter takes up more space (the information is both in the OEIS data files and in our special database) but is likely more responsive

  2. The code for backscope will have to know where the OEIS data files reside so it can do the various necessary data extractions (whenever we decide it should do them, per item 1). So should those OEIS data files, i.e. the clone of the OEIS data git repository, be: a. A git submodule that points to a specific commit of the OEIS data repository? b. A git subtree that incorporates a particular state of the OEIS data repository as a collection of files in backscope's repository? c. An untracked subdirectory of the backscope repository that is expected to contain some clone of the OEIS data? (This option would likely include tools in the backscope repository to update that subdirectory to OEIS's latest version.) d. A sibling directory of the backscope repository, again expected to contain some clone of the OEIS data?

Personally, I think 2b is just out because it would make our backscope repository huge. I am not a fan of 2a, because it means that we will need to make a commit to backscope to update the OEIS data to a new version, whereas I think we would prefer to just pull the OEIS data once a week (say in a cron job) without having to touch backscope if nothing else has changed. So I think I am leaning toward 2c. I don't have strong feelings in any direction on question 1.

However, bottom line is I don't think implementation on this should commence until @katestange @Vectornaut and I have reached consensus on this design issue, perhaps at an upcoming meeting.

katestange commented 1 week ago

Yes, sounds like a meeting discussion. A few thoughts while I have them. For 2, I agree that a cron job is probably ideal. It raises another point though: how will we run local copies of backscope for testing? Will we need to have the OEIS data on our own machines? For 1, another consideration is whether it may be faster and easier to implement something that doesn't change database behaviour much -- i.e. just replace http calls with local lookups but otherwise have the same database structure etc.

gwhitney commented 1 week ago

Will we need to have the OEIS data on our own machines?

Personally I don't see a way around that, if the code is running in a regime in which it expects to have the data on the local drive. I think it is only a few gigabytes.

it may be faster and easier to implement something that doesn't change database behaviour much

Yes, I agree that is the implementation path of least resistance, which is exactly why I raised the issue of what do we want in the long run, so that we don't necessarily just follow the path of easiest implementation.