Open CTrando opened 5 years ago
Make sure to work on a separate branch when you start doing the golang development. Don't worry about the frontend for now, if the request doesn't have a quarter default it to the latest one.
For the "latest one", since the backend and scraper do not share the same config currently, I just created another config entry for the backend. Don't know if that works
For the current error:
sdschedule-scraper | Traceback (most recent call last):
sdschedule-scraper | File "./webreg_scrape_upload.py", line 51, in <module>
sdschedule-scraper | main()
sdschedule-scraper | File "./webreg_scrape_upload.py", line 23, in main
sdschedule-scraper | department_scraper = DepartmentScraper()
sdschedule-scraper | File "/app/scraper_impl/department_scraper.py", line 20, in __init__
sdschedule-scraper | self.browser = webdriver.Chrome(chrome_options=options, executable_path=DRIVER_PATH)
sdschedule-scraper | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
sdschedule-scraper | desired_capabilities=desired_capabilities)
sdschedule-scraper | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 154, in __init__
sdschedule-scraper | self.start_session(desired_capabilities, browser_profile)
sdschedule-scraper | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 243, in start_session
sdschedule-scraper | response = self.execute(Command.NEW_SESSION, parameters)
sdschedule-scraper | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 311, in execute
sdschedule-scraper | self.error_handler.check_response(response)
sdschedule-scraper | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 237, in check_response
sdschedule-scraper | raise exception_class(message, screen, stacktrace)
sdschedule-scraper | selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome version must be >= 69.0.3497.0
sdschedule-scraper | (Driver info: chromedriver=72.0.3626.69 (3c16f8a135abc0d4da2dff33804db79b849a7c38),platform=Linux 4.9.0-8-amd64 x86_64)
sdschedule-scraper |
I did some search and realized that this is a problem with versions of chrome and chrome-driver.
The driver in the driver directory seems to mismatch with the version needed by the container.
And I don't quite understand the reason for using this image.
We might need to prune the requirements.txt a little bit. I'm trying to figure it out
Still working on pruning.
Hard to test since each scraping takes at least 20min.
Maybe we want a feature that our scraper only scrapes for one department for testing.
Don't know if the new Dockerfile and requirements.txt are good
Correct me if I'm wrong: it seems that the code creates a new db connection each time data is requested, which should not be the case, right?
I can take a look at this
I tried my best to use design patterns, including Fail Fast, singleton for DB, and follow some Go conventions
And also added some comments which follows the godoc rules
Now the logging and error checking is not graceful, might consider some functions for them (or make another closure)
I don't know why the original error-checking if statements do not have "return", will the function terminate properly?
About the panic and logging: I don't think our server should crash if it's just one query failed. And I put all the logic about error handling and logging in the route package
The tests do not work yet since I changed the signature of the handlers.
The code compiles but since I don't have a local MySQL instance I will check later
I am confused how you tested the code previously. I believe testing should not use real database.
I fixed the tests except one, this is the output:
go test -v ./...
=== RUN TestGetClassData
--- FAIL: TestGetClassData (0.00s)
GetClassData_test.go:73: json: cannot unmarshal object into Go value of type []routes.Subclass
GetClassData_test.go:77: Subclasses for CSE 11 should definitely not be 0!
=== RUN TestGetRequestClassDataFails
--- PASS: TestGetRequestClassDataFails (0.00s)
=== RUN TestGetDepartmentSummary
--- PASS: TestGetDepartmentSummary (0.00s)
=== RUN TestGetDepartmentSummaryFailsOnPost
--- PASS: TestGetDepartmentSummaryFailsOnPost (0.00s)
=== RUN TestGetDepartments
--- PASS: TestGetDepartments (0.00s)
=== RUN TestGetDepartmentsOnPostFails
--- PASS: TestGetDepartmentsOnPostFails (0.00s)
=== RUN TestGetInstructors
--- PASS: TestGetInstructors (0.00s)
=== RUN TestGetTypes
--- PASS: TestGetTypes (0.00s)
FAIL
FAIL github.com/ucsdscheduleplanner/UCSD-Schedule-Planner/backend/test 0.314s
I did not change the code for getClassData at all but the JSON thing does not seem to be working.
Brief:
Sample code:
const logTagDepartment = "[Department]"
// GetDepartments is a pre-http.HandlerFunc for department route and will become a closure with *DatabaseStruct
func GetDepartments(writer http.ResponseWriter, request *http.Request, ds *db.DatabaseStruct) {
if request.Method != "GET" {
errInvalidMethod(logTagDepartment, writer, request)
return
}
queryAndResponse(
ds, logTagDepartment, writer, request,
rowScannerOneString,
"SELECT DISTINCT DEPT_CODE FROM DEPARTMENT",
"DEPARTMENT")
}
Two functions calls is exactly what is needed.
I did not change the way of query for GetClassData since I do not quite understand the reason behind using a for loop rather than using one query
I did not put too much efforts on re-writing the testing files (no major programming design principles applied) but one test is still failing
Perhaps instead of "db", we should call that file "ds" or "Env" for the environment of the whole server-side software
I left several TODO and TODEL in the code
I'm not extremely confident about the way to handle errors and logging (for routes)
The next step is to write Dockerfile and try to integrate it with other componenets
So if there is a test failing that is an issue because I made the tests such that the JSON format is what the frontend expects. Also I agree that it's not great to require a database connection in order to run the tests, but these integration tests are what's needed for full confidence, unit tests won't guarantee anything.
Thank you so much for your feedback!
I will try to follow them and try to dive deeper! I found this sql query interesting: https://stackoverflow.com/questions/1136380/sql-where-in-clause-multiple-columns
Let's use the username: "splanner", shall we?
I refactored things. It compiles
Next step is
This is a long pull request.
I finally figure out why the class data route test fails. The test expects a pure array of json. But the code returns a map like this: {"ABC 11": [...], "MATH 12": [...]}
Let's first reach a consensus about out API:
Ok sure, I can write something up on the contract.
Ok sure, I can write something up on the contract.
That will be great! Include but not limit to, api names (do we really need the api prefix? I have no idea), api format, api input format, api input check, cache-ability.
Should work as the first scratch, but somethings existed before, like caching, is not implemented yet.
About the query
Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.
But of course, there is never a perfect solution that works optimally for all situations. I think I will be confident if I read the front-end code which calls that api.
Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.
Sure I agree with you that it will be more performant, but I'm saying there is a tradeoff between readability and maintainability in combining all the queries into one. By doing this way we add more edge cases and potential mishaps. That's why you should be confident in your current algorithm before switching to this, but I agree doing it in one query is more performant.
Ok let's define the format.
We have
api_department?quarter={QUARTER}
: Returns a list of departments for the given quarter, e.g ["CSE", "ECE", etc] - This is called when the page initially loads up to get a list of departments to show.api_classpreview?quarter={QUARTER}&department={DEPARTMENT}
: Returns a list of objects of the form:
{courseNum: COURSE_NUM, data : {description: DESCRIPTION, instructors: [INSTRUCTORS], types: [TYPES]}}
. It's used to show the user a preview of the class, for example it shows the user that 11
is a course number for department CSE
. INSTRUCTORS and TYPES are a list of instructors and types such as [Joe Politz, Rick Ord]
, [LE, DI]
. api_classdata
: This one is up for debate. We can either make it take in only one class (a department + courseNum pair) and return the data for just that class, or we can have it such that it takes in a list of pairs and it gives results for that entire one. I think maybe it would be better to do it as multiple small requests because it would be easier to process on the backend and frontend. Performance of simple query always triumphs. However, back and forth database calls should be generally avoided. That is sync calls between components, which is to put components in series.
Sure I agree with you that it will be more performant, but I'm saying there is a tradeoff between readability and maintainability in combining all the queries into one. By doing this way we add more edge cases and potential mishaps. That's why you should be confident in your current algorithm before switching to this, but I agree doing it in one query is more performant.
If we restrict the quarter to be identical per query (it's weird for people to schedule multiple quarters on one calendar), then my SQL building function will be much simpler (just creating correct number of question marks)
I was thinking using a api.example.com domain for apis would be better than the "api_" prefix
api_department?quarter={QUARTER}
[{"AAA", "BBB"}]
api_classreview?quarter={QUARTER}&department={DEPARTMENT}
[{courseNum: 99, data : {description: abyss, instructors: ["A", "B"], types: ["LE", "DI"]}}, {courseNum: 100, data : {description: void, instructors: ["A", "B"], types: ["LE", "DI"]}}]
api_classdata
POST with a list of requests
{
"quarter": "CLASS_DATA",
"classes": [{"department": "AAA", "courseNum": "1"}, {"department": "AAA", "courseNum": "2"}]
}
I would like more details about the response format that fits the front-end 's needs well.
[...]
I couldn't figure out why it's easier to process with one class per request. For front end, we will send an array of promises and wait_all, otherwise it will be an overkill for performance. For back end, we will have exponential number of requests with more users since they will add more and more classes.
https://restfulapi.net/rest-api-design-tutorial-with-example/
P.S. about row-constructor performance, according to this page, it seems that as long as we create proper index for composite keys, there should be no huge difference. And it should be safe to conclude that doing the same thing with multiple queries will be slower than with one query.
Finally I came up with a solution that works! We just need to support 2 HTTP methods for "class_data", the GET with query strings returns one class. the POST with a json array returns an array. And we could potentially hard-code some upper limit about the post request and may even check the duplications inside. I'm looking forward to your input!
Make sure to work on a separate branch when you start doing the golang development. Don't worry about the frontend for now, if the request doesn't have a quarter default it to the latest one.