Open benoit74 opened 1 year ago
I went through this issue. This is what I suppose is to be implemented.
After going through this code.
def populate_nodes_executor(self):
"""Loop on content nodes to create zim entries from kolibri DB"""
def schedule_node(item):
future = self.nodes_executor.submit(self.add_node, item=item)
self.nodes_futures.add(future)
# schedule root-id
schedule_node((self.db.root["id"], self.db.root["kind"]))
# fill queue with (node_id, kind) tuples for all root node's descendants
for node in self.db.get_node_descendants(self.root_id):
if self.node_ids is None or node["id"] in self.node_ids:
schedule_node((node["id"], node["kind"]))
I suppose I should track self.nodes_futures. Let me know if I am on the right track
Yes, plus the videos_futures ; videos_futures are particularly important since they are populated when a video needs reencoding, and quite often this takes way longer that nodes processing.
However, I suspect this multiprocessing code is significantly broken, see https://github.com/openzim/kolibri/issues/106
I suspect we will not use this methods anymore in the future, or at least we will most probably mutualise the multiprocessing logic in a shared module.
I don't know if it is really convenient to implement this scraper progress feature now, given that it might be difficult to debug due to the other issue + the functions might change in the future.
Add support to generate the
task_progress.json
file, so that it can reported by Zimfarm workers and be displayed in Zimfarm UI