osuosl / orvsd_central

5 stars 2 forks source link

make siteinfo data gathering smarter #21

Closed justinnoah closed 10 years ago

justinnoah commented 10 years ago

Because we don't have a comprehensive list of all the school sites that exist, the siteinfo gathering script has to guess whether siteinfo belongs to an existing or new school. If it can't find an existing school with the same base_url, it assumes a new school.

The main problem is that when a new school is made, the district for that school is not known - we should add some logic to help it guess what district the new school belongs in.

The best method is probably a comparison of the incoming data's base_url with existing schools - other schools with the same root domain are probably related, so we can guess they belong to the same district.

For instance, a new site 'Tillamook School District' has a baseurl of 'tillamook.orvsd.org'. There are several schools already existing in the Tillamook district, but their domain fields are all subdomains: eastelem.tillamook.orvsd.org, tillamookhigh.tillamook.orvsd.org, etc. So, this is a legitimate new site, belonging to a legitimate new school (probably), but we can guess that it should belong to the same district as schools with similar root domains.

This logic should be simple enough to implement in the gather_siteinfo method. If no guess can be made, the default "Unknown" district should be designed, we will eventuall implement a UI to assign these orphans to the correct district.

dean commented 10 years ago

This is partially (or maybe even wholely) implemented in develop. Check the gather_siteinfo method for code that does what this is asking. IIRC it was merged in without being tested, but I did eventually test it on prod and it worked fine. There was nothing to compare the results against however, so I'm unsure if improvements to the sorting were actually made.

justinnoah commented 10 years ago

@dean With the way we import data (once merged, see #14 ) we don't have a list of baseurls to compare to so:

Because we don't have a comprehensive list of all the school sites that exist, the
siteinfo gathering script has to guess whether siteinfo belongs to an existing or
new school. If it can't find an existing school with the same base_url, it assumes a
new school.

is not a possible way of discovery. School / Site name comparisons will have to be made in order to match schools and sites.