oss-know / airflow-jobs

Apache License 2.0
6 stars 19 forks source link

Improve the performance of github profile init by parallesm #179

Closed crystaldust closed 1 year ago

crystaldust commented 1 year ago

Current github profiles init is done in a serial way, sending http request to timeline/commit API one after another, use executor pool to send multiple requests in parallel for better performance.

crystaldust commented 1 year ago

Update: currently fetching/updating profiles are done by custom multi threading. Test if it's better to use executor pool instead.

crystaldust commented 1 year ago

Update2: Currently a custom Thread class is defined with a return value by getResult() method. The instance property result is not initialized in the __init__() method, but assigned later in run() method. So when getResult() is called after join(), and run() actually not finished(most likely not finished within the default timeout of join method), the code will compain that the thread instances has no attribute result and break the whole task.

Another point is, the join method is called just after start. So it essentially runs in a serial way. That's way there is multithreading but it doesn't perform as expected.

The concurrent_threads is used in a parallel way when initializing issue_comments and issue_timeline. But it can be perfectly replaced by the builtin ThreadPoolExecutor.

crystaldust commented 1 year ago

Solved by #181