#86 Decoupling the scraper from the backend and editing the scraper to make it more versatile with different quarters

CTrando commented 5 years ago

Make sure to work on a separate branch when you start doing the golang development. Don't worry about the frontend for now, if the request doesn't have a quarter default it to the latest one.

snowme34 commented 5 years ago

Make sure to work on a separate branch when you start doing the golang development. Don't worry about the frontend for now, if the request doesn't have a quarter default it to the latest one.

For the "latest one", since the backend and scraper do not share the same config currently, I just created another config entry for the backend. Don't know if that works

snowme34 commented 5 years ago

For the current error:

sdschedule-scraper     | Traceback (most recent call last):
sdschedule-scraper     |   File "./webreg_scrape_upload.py", line 51, in <module>
sdschedule-scraper     |     main()
sdschedule-scraper     |   File "./webreg_scrape_upload.py", line 23, in main
sdschedule-scraper     |     department_scraper = DepartmentScraper()
sdschedule-scraper     |   File "/app/scraper_impl/department_scraper.py", line 20, in __init__
sdschedule-scraper     |     self.browser = webdriver.Chrome(chrome_options=options, executable_path=DRIVER_PATH)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
sdschedule-scraper     |     desired_capabilities=desired_capabilities)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 154, in __init__
sdschedule-scraper     |     self.start_session(desired_capabilities, browser_profile)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 243, in start_session
sdschedule-scraper     |     response = self.execute(Command.NEW_SESSION, parameters)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 311, in execute
sdschedule-scraper     |     self.error_handler.check_response(response)
sdschedule-scraper     |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 237, in check_response
sdschedule-scraper     |     raise exception_class(message, screen, stacktrace)
sdschedule-scraper     | selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome version must be >= 69.0.3497.0
sdschedule-scraper     |   (Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.9.0-8-amd64 x86_64)
sdschedule-scraper     |

I did some search and realized that this is a problem with versions of chrome and chrome-driver.

The driver in the driver directory seems to mismatch with the version needed by the container.

And I don't quite understand the reason for using this image.

snowme34 commented 5 years ago

We might need to prune the requirements.txt a little bit. I'm trying to figure it out

snowme34 commented 5 years ago

Still working on pruning.

Hard to test since each scraping takes at least 20min.

Maybe we want a feature that our scraper only scrapes for one department for testing.

Don't know if the new Dockerfile and requirements.txt are good

snowme34 commented 5 years ago

Correct me if I'm wrong: it seems that the code creates a new db connection each time data is requested, which should not be the case, right?

I can take a look at this

snowme34 commented 5 years ago

I tried my best to use design patterns, including Fail Fast, singleton for DB, and follow some Go conventions

And also added some comments which follows the godoc rules

Now the logging and error checking is not graceful, might consider some functions for them (or make another closure)

I don't know why the original error-checking if statements do not have "return", will the function terminate properly?

About the panic and logging: I don't think our server should crash if it's just one query failed. And I put all the logic about error handling and logging in the route package

snowme34 commented 5 years ago

The tests do not work yet since I changed the signature of the handlers.

The code compiles but since I don't have a local MySQL instance I will check later

snowme34 commented 5 years ago

I am confused how you tested the code previously. I believe testing should not use real database.

I fixed the tests except one, this is the output:

go test -v ./...                                                                                 
=== RUN   TestGetClassData                                                                         
--- FAIL: TestGetClassData (0.00s)                                                                 
    GetClassData_test.go:73: json: cannot unmarshal object into Go value of type []routes.Subclass 
    GetClassData_test.go:77: Subclasses for CSE 11 should definitely not be 0!                     
=== RUN   TestGetRequestClassDataFails                                                             
--- PASS: TestGetRequestClassDataFails (0.00s)                                                     
=== RUN   TestGetDepartmentSummary                                                                 
--- PASS: TestGetDepartmentSummary (0.00s)                                                         
=== RUN   TestGetDepartmentSummaryFailsOnPost                                                      
--- PASS: TestGetDepartmentSummaryFailsOnPost (0.00s)                                              
=== RUN   TestGetDepartments                                                                       
--- PASS: TestGetDepartments (0.00s)                                                               
=== RUN   TestGetDepartmentsOnPostFails                                                            
--- PASS: TestGetDepartmentsOnPostFails (0.00s)                                                    
=== RUN   TestGetInstructors                                                                       
--- PASS: TestGetInstructors (0.00s)                                                               
=== RUN   TestGetTypes                                                                             
--- PASS: TestGetTypes (0.00s)                                                                     
FAIL                                                                                               
FAIL    github.com/ucsdscheduleplanner/UCSD-Schedule-Planner/backend/test       0.314s

I did not change the code for getClassData at all but the JSON thing does not seem to be working.

snowme34 commented 5 years ago

Brief:

I employed principles including but not limited to SRP, DRY, OO, OCP and so on when typing on the keyboard

Sample code:

const logTagDepartment = "[Department]"

// GetDepartments is a pre-http.HandlerFunc for department route and will become a closure with *DatabaseStruct
func GetDepartments(writer http.ResponseWriter, request *http.Request, ds *db.DatabaseStruct) {

    if request.Method != "GET" {
        errInvalidMethod(logTagDepartment, writer, request)
        return
    }

    queryAndResponse(
        ds, logTagDepartment, writer, request,
        rowScannerOneString,
        "SELECT DISTINCT DEPT_CODE FROM DEPARTMENT",
        "DEPARTMENT")
}

Two functions calls is exactly what is needed.

I did not change the way of query for GetClassData since I do not quite understand the reason behind using a for loop rather than using one query
I did not put too much efforts on re-writing the testing files (no major programming design principles applied) but one test is still failing
Perhaps instead of "db", we should call that file "ds" or "Env" for the environment of the whole server-side software
I left several TODO and TODEL in the code
I'm not extremely confident about the way to handle errors and logging (for routes)
The next step is to write Dockerfile and try to integrate it with other componenets

CTrando commented 5 years ago

So if there is a test failing that is an issue because I made the tests such that the JSON format is what the frontend expects. Also I agree that it's not great to require a database connection in order to run the tests, but these integration tests are what's needed for full confidence, unit tests won't guarantee anything.

snowme34 commented 5 years ago

Thank you so much for your feedback!

I will try to follow them and try to dive deeper! I found this sql query interesting: https://stackoverflow.com/questions/1136380/sql-where-in-clause-multiple-columns

snowme34 commented 5 years ago

Let's use the username: "splanner", shall we?

I refactored things. It compiles

add several structs, like ErrorStruct and QueryStruct
separate queryAndResponse into query and response
rename files based on the recommendations here
more comments
rewrite the beast, class data route, to create a query on the fly

Next step is

add env to remove awkward hard-coded variable
use docker to add integration (test)
more refactor!
~~fix the awkward multiple query for class data by means of row constructor~~
improve error according to https://tools.ietf.org/html/rfc7807
use env instead of *db
closure the query function for handlers with env.db
add caching

snowme34 commented 5 years ago

This is a long pull request.

I finally figure out why the class data route test fails. The test expects a pure array of json. But the code returns a map like this: {"ABC 11": [...], "MATH 12": [...]}

unit tests are suggested to be put right next to the source code by effective go
test directory is fine with integration test

Let's first reach a consensus about out API:

contract
- format
- keys and values
duplicates
- what happens if user requests same classes multiple times

CTrando commented 5 years ago

Ok sure, I can write something up on the contract.

snowme34 commented 5 years ago

Ok sure, I can write something up on the contract.

That will be great! Include but not limit to, api names (do we really need the api prefix? I have no idea), api format, api input format, api input check, cache-ability.

Should work as the first scratch, but somethings existed before, like caching, is not implemented yet.

About the query

Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.

But of course, there is never a perfect solution that works optimally for all situations. I think I will be confident if I read the front-end code which calls that api.

CTrando commented 5 years ago

Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.

Sure I agree with you that it will be more performant, but I'm saying there is a tradeoff between readability and maintainability in combining all the queries into one. By doing this way we add more edge cases and potential mishaps. That's why you should be confident in your current algorithm before switching to this, but I agree doing it in one query is more performant.

CTrando commented 5 years ago

Ok let's define the format.

We have

api_department?quarter={QUARTER} : Returns a list of departments for the given quarter, e.g ["CSE", "ECE", etc] - This is called when the page initially loads up to get a list of departments to show.
api_classpreview?quarter={QUARTER}&department={DEPARTMENT} : Returns a list of objects of the form: {courseNum: COURSE_NUM, data : {description: DESCRIPTION, instructors: [INSTRUCTORS], types: [TYPES]}}. It's used to show the user a preview of the class, for example it shows the user that 11 is a course number for department CSE. INSTRUCTORS and TYPES are a list of instructors and types such as [Joe Politz, Rick Ord], [LE, DI].
api_classdata: This one is up for debate. We can either make it take in only one class (a department + courseNum pair) and return the data for just that class, or we can have it such that it takes in a list of pairs and it gives results for that entire one. I think maybe it would be better to do it as multiple small requests because it would be easier to process on the backend and frontend.

snowme34 commented 5 years ago

Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.

Sure I agree with you that it will be more performant, but I'm saying there is a tradeoff between readability and maintainability in combining all the queries into one. By doing this way we add more edge cases and potential mishaps. That's why you should be confident in your current algorithm before switching to this, but I agree doing it in one query is more performant.

If we restrict the quarter to be identical per query (it's weird for people to schedule multiple quarters on one calendar), then my SQL building function will be much simpler (just creating correct number of question marks)

I was thinking using a api.example.com domain for apis would be better than the "api_" prefix

api_department?quarter={QUARTER}

[{"AAA", "BBB"}]

api_classreview?quarter={QUARTER}&department={DEPARTMENT}

[{courseNum: 99, data : {description: abyss, instructors: ["A", "B"], types: ["LE", "DI"]}}, {courseNum: 100, data : {description: void, instructors: ["A", "B"], types: ["LE", "DI"]}}]

api_classdata

POST with a list of requests

{
"quarter": "CLASS_DATA",
"classes": [{"department": "AAA", "courseNum": "1"}, {"department": "AAA", "courseNum": "2"}] 
}

I would like more details about the response format that fits the front-end 's needs well.

[...]

I couldn't figure out why it's easier to process with one class per request. For front end, we will send an array of promises and wait_all, otherwise it will be an overkill for performance. For back end, we will have exponential number of requests with more users since they will add more and more classes.

https://restfulapi.net/rest-api-design-tutorial-with-example/

P.S. about row-constructor performance, according to this page, it seems that as long as we create proper index for composite keys, there should be no huge difference. And it should be safe to conclude that doing the same thing with multiple queries will be slower than with one query.

Finally I came up with a solution that works! We just need to support 2 HTTP methods for "class_data", the GET with query strings returns one class. the POST with a json array returns an array. And we could potentially hard-code some upper limit about the post request and may even check the duplications inside. I'm looking forward to your input!

ucsdscheduleplanner / UCSD-Schedule-Planner

#86 Decoupling the scraper from the backend and editing the scraper to make it more versatile with different quarters #92