openzim / kolibri

Convert a Kolibri channel in ZIM file(s)
GNU General Public License v3.0
8 stars 13 forks source link

Report scraper progress #52

Open benoit74 opened 1 year ago

benoit74 commented 1 year ago

Add support to generate the task_progress.json file, so that it can reported by Zimfarm workers and be displayed in Zimfarm UI

githyuvi commented 6 months ago

I went through this issue. This is what I suppose is to be implemented.

After going through this code.

    def populate_nodes_executor(self):
        """Loop on content nodes to create zim entries from kolibri DB"""

        def schedule_node(item):
            future = self.nodes_executor.submit(self.add_node, item=item)
            self.nodes_futures.add(future)

        # schedule root-id
        schedule_node((self.db.root["id"], self.db.root["kind"]))

        # fill queue with (node_id, kind) tuples for all root node's descendants
        for node in self.db.get_node_descendants(self.root_id):
            if self.node_ids is None or node["id"] in self.node_ids:
                schedule_node((node["id"], node["kind"]))

I suppose I should track self.nodes_futures. Let me know if I am on the right track

benoit74 commented 6 months ago

Yes, plus the videos_futures ; videos_futures are particularly important since they are populated when a video needs reencoding, and quite often this takes way longer that nodes processing.

However, I suspect this multiprocessing code is significantly broken, see https://github.com/openzim/kolibri/issues/106

I suspect we will not use this methods anymore in the future, or at least we will most probably mutualise the multiprocessing logic in a shared module.

I don't know if it is really convenient to implement this scraper progress feature now, given that it might be difficult to debug due to the other issue + the functions might change in the future.