ryanhugh / searchneu

Search over Classes, Professors and Employees at NEU!
https://searchneu.com
GNU Affero General Public License v3.0
74 stars 18 forks source link

Optimization of how we process courses #54

Closed edward-shen closed 6 years ago

edward-shen commented 6 years ago

Currently, if my understanding is correct, we parse the data in the following manner.

  1. Collect and store the data.
  2. Parse the data into our JSON format.
  3. Run processors over our stored data, injecting and modifying data when necessary.

The issue with this is that every time we add a new feature to our data, we have to reprocess the entire database. I suggest implementing a single processor that accepts other processor functions and calls those functions when it processes each class--kind of a Observer pattern. This way, adding a new processor does not add on O(n) time.

ryanhugh commented 6 years ago

I can definitely see where you are coming from, but right now all of the processors combined only take around 500-1500ms to run on the entire set of data. Each one takes around ~200ms to run. Compared to the scraping (~30min) this is pretty quick and is a pretty low percentage of the total time spent fetching all the data.

From what it sounds like you are suggesting to "factor out" a for loop over the data from each one of the processors. So there would be one loop over the data and then, during the loop, each processor would be called. However, different processors loop over different pieces of data (classes, terms, sections, etc) and combine different pieces of data from different parts of the data set in different ways. It would be hard to abstract away all the various ways to loop over these and keep track of everything that each processor wants to keep track of.

I think this change would probably speed up the processors a little bit, but it wouldn't actually reduce the time complexity because O(5*n) = O(n)

Also the processors should be capable of being re-ran on data that they have already processed and shouldn't modify it at all. They are also much faster if they they are ran on data they have already processed (~200-300 ms for all processors).

TLDR: The processors are pretty fast, and I don't think its worth putting a lot of effort in to optimizing them even further. :)

ryanhugh commented 6 years ago

Going to close this for now - feel free to re-open :)