openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
285 stars 73 forks source link

Provide en estimation of progress #539

Open kelson42 opened 5 years ago

kelson42 commented 5 years ago

We need a way to know roughly how many percents of the whole scraping has already been done.

automactic commented 5 years ago

I strongly support implementing this. It is gonna help zimfarm a lot of we can get the current progress. Do we have any idea on how this is going to be implemented?

ISNIT0 commented 5 years ago

Probably using something like this... https://github.com/vadimdemedes/ink

automactic commented 5 years ago

Can this info be retrieved programmatically in a container environment?

ISNIT0 commented 5 years ago

Not using ink specifically, but we could implement some kind of api

ISNIT0 commented 5 years ago

@automactic Is the existing percentage log good enough for this?

automactic commented 5 years ago

No I don't think so. I would prefer a way to retrieve progress proactively, rather than passively wait for a progress message to show up.

kelson42 commented 5 years ago

@automactic Ok, that sounds a bit more complicated that what I thought. What kind of technology you have in mind to achieve to do so?

kelson42 commented 5 years ago

@automactic @ISNIT0 What do you think about using something like https://www.zerorpc.io/ over a socket in /var/run/?

ISNIT0 commented 5 years ago

@automactic I think a progress API is a bit out of scope of MWOffliner. Would it be possible to grep/match logs from MWO? I'm happy to re-format logs to be more machine processable

kelson42 commented 5 years ago

@rgaudin Have you any past experience with the solution proposed by @ISNIT0 ?

automactic commented 5 years ago

I think parse logs is not the best solution and could tend to be flaky. For example, the container might generate a lot of logs so there happens to be no progress info in the batch of logs being fetched.

How about set the progress in Redis? zimfarm worker will periodically GET the key of progress stored in Redis. If make sense, we could provide more detailed stage based progress, etc.

automactic commented 5 years ago

zimfarm worker could also listen to some key changes in Redis, so user could be notified of events in mwoffliner.

rgaudin commented 5 years ago

I think there are two different things to consider:

Both should be somewhat independant and tackled separately.

The first one is a mechanism to calculate the effort and report on progress towards this effort. This is internal to mwoffliner and should be available in the logs somehow (could be a periodic print on the log).

The second one, which depends on the first one of course (we need the calculation) should be implemented in a way that can be duplicated on other scrapers. That excludes redis. I think we could introduce a super simple API to which scrapers would report progress to.

That interface/API could be HTTP or socket-based (zerorpc?). What matters most here is the simplicity to implement calls to that API in all of the scrapers (ie. using their various technologies).

--report-progress-to /var/run/zimfarm_mwoffliner_xxx.sock

automactic commented 5 years ago

That excludes redis

that is a good point, forget what I said

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 3 years ago

We have an agreement that mwoffliner should be given a json file path to update with progresses. This file been then read by the Zimfarm worker (or any other process) to then report.

kelson42 commented 3 years ago

See https://github.com/openzim/zimfarm/issues/331 to be implemented in the next days.

rgaudin commented 3 years ago

The expected format of this JSON file is:

{"done": 1, "total": 32}

Its name should be passed to an option enabling that feature and, for the zimfarm, we'll place it in the output directory (what we mount as a volume). in zimit, we allow passing either an absolute path or a relative one; in which case we create it in output dir.

kelson42 commented 3 years ago

@rgaudin Concretly, what would you propose in term of command line option?

rgaudin commented 3 years ago

In zimit, we have --statsFilename but it doesn't matter much as long as the other mentioned requirements are met.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

glems2 commented 1 year ago

This feature would be a true benefit, especially considering there is no resume functionality!