openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
271 stars 72 forks source link

Filter media by file size #1767

Open uriesk opened 1 year ago

uriesk commented 1 year ago

Add --maxAudioSize and --maxVideoSize option to ignore large media files, but scrape the smaller ones.

Background

I analyzed the recent wikipedia_en_top1m build, which included all media files and reached 200GB here: https://gist.github.com/uriesk/a1edcbaf2f2194cffddc71b26beaaa45 with the result that we could include a very large set of media files in maxi scrapes for reasonable storage costs.

Resulting size of 10760 videos: 124 GiB. Out of which the largest 200 files are 59 GiB

If we remove all video files that are >20 MiB, we have 9767 files left (90.7%) and have a remaining size of 27 GiB.

Mean video file size is 1.6 MiB and the smaller 50% of video files have a size of 3.2 GiB.

Resulting size of 26691 audo files: 38 GiB. Out of which the largest 200 files are 9,4 GiB.

If we remove all audio files that are >10 MiB, we have 25667 files left (96,2%) with a remaining size of 16 GiB. Or if we remove with >3.2 MiB, we have 90% remaining with 6.6 GiB.

kelson42 commented 1 year ago

@uriesk Your ticket is tackling a topic which has been barrely discussed in the past. Therefore I have no strong opinion on this. Probably things will need to mature a bit as well.

First, here 200 GB is not an absolute blocker. That said, I agree that optimising audio and videos is ultimatively something important... exactly for the same reason like for pictures.

In general, I'm not in favour of changing or extending this nopic, novid, ... formats. The reasons is that this classification has impacts everywhere.

To reduce the storage/bandwith consumption, I prefer we do the same as for pictures:

... doing so implies as well the usage of the S3 cache...

uriesk commented 1 year ago

@kelson42 the issue that i see is that if you advertise a zim file as includes videos, the one who downloads it expects all videos to be inside and not just some seemingly random subset.

Another issue is the implementation, because to know the filesize, you have to request it and cancel it after receiving headers. I did that for all those files (i didn't download the 200 GB) and it was done in 3 hours. Which isn't much additional time for a scrap that takes multiple days already, but maybe upstream does not like it that way.

But overall i think it is worth it. If i can add 50% of all videos to wikipedia_en_all_maxi for mere 10 GiB extra? I would rather download that than a novid scrap. And wikipedia hosts full length movies so scrapping either all or nothing seems weird.

I will look into this when i have time and might implement it.

Edit: I think playing with the quality of videos won't get us far. Most videos are downloaded in VP9 encoded <300p versions already and re-encoding them would be an enormous task.

kelson42 commented 1 year ago

@kelson42 the issue that i see is that if you advertise a zim file as includes videos, the one who downloads it expects all videos to be inside and not just some seemingly random subset.

indeed, this is why the less surprising approach is to remove none of them.

Another issue is the implementation, because to know the filesize, you have to request it and cancel it after receiving headers. I did that for all those files (i didn't download the 200 GB) and it was done in 3 hours. Which isn't much additional time for a scrap that takes multiple days already, but maybe upstream does not like it that way.

I would not worry about that. In particular they should be in the cache anyway.

But overall i think it is worth it. If i can add 50% of all videos to wikipedia_en_all_maxi for mere 10 GiB extra? I would rather download that than a novid scrap.

Actually do videos are in full resolution or in thumb resolution?

holta commented 1 year ago

If i can add 50% of all videos to wikipedia_en_all_maxi for mere 10 GiB extra? I would rather download that than a novid [scrape].

Yes!

WMF (Wikimedia Foundation) should also provide us statistics as to which videos (& audio clips) are demonstrably more compelling.

Obviously popularity is very different from quality/importance (sometimes they're inversely correlated, in the case of viral tabloid material obviously) but still this is vital data WMF might be able to provide us ❓

uriesk commented 1 year ago

@kelson42 if i understand it correctly, it looks at the available sources and chooses the one that matches the element size the best. Most videos are in small resolutions. This is the list of all webm files and their size in bytes: https://gist.github.com/uriesk/a1edcbaf2f2194cffddc71b26beaaa45/raw/dd55af85e2a28ad5170e5a61e68e444a2b5dc3e7/top1m_webm.txt The filename gives it away most of the time 240p.vp9.webm, 120p.vp9.webm

uriesk commented 1 year ago

I think playing with the quality of videos won't get us far. Most videos are downloaded in VP9 encoded <300p versions already and re-encoding them would be an enormous task.

I take this back. About half of the videos are in VP8 and would benefit from encoding them in VP9. Are we powerful enough to handle that? Encoding hundreds of GBs of videos?

kelson42 commented 1 year ago

I think playing with the quality of videos won't get us far. Most videos are downloaded in VP9 encoded <300p versions already and re-encoding them would be an enormous task.

I take this back. About half of the videos are in VP8 and would benefit from encoding them in VP9. Are we powerful enough to handle that? Encoding hundreds of GBs of videos?

Interesting challenge. But we would do that only once per file... Worth the challenge. If they are in thumb format, this is not that much data.

A compromise might be to trunk after a certain time and display a simple message. Something like "We had to trunc this video because it was too long. Full version is available online".

uriesk commented 1 year ago

Choosing the right codec is a challenge in itself. Picked a random one. Available sources:

image

mwoffliner did choose the VP8 160p.webm one. The Lowest Bandwidth option is a VP9 120p.vp9.webm. But the VP9 that is the closest in the resolution to the VP8 is 180p.vp9.webm

Filesizes are: 4,3M USA_Sniper_School.ogv.120p.vp9.webm 4,6M USA_Sniper_School.ogv.160p.webm 6,1M USA_Sniper_School.ogv.180p.vp9.webm

And it's the same with most VP8 videos in the list. They are most of the time the second-smallest option (by resolution) with the one below and the one above being VP9, each one with about 10% size difference.

We could make mwoffliner always choose the Lowest Bandwidth option.

uriesk commented 1 year ago

For this one, which is a different kind of element, mwoffliner choose the 240p.vp9.webm, while the available sizes are:

140K    Salt_Bae.webm.120p.vp9.webm
260K    Salt_Bae.webm.240p.vp9.webm
224K    Salt_Bae.webm.240p.webm

It's interesting that Wikimedia is encoding the VP9 with higher bitrates than the VP8.

uriesk commented 1 year ago

The bitrate is in the source element as data-bandwidth and the duration is in the video element as data-durationhint. so we actually know the filesize already from the html. Same for audio files.

I think this will be far easier than i imagined it to be.

holta commented 1 year ago

A compromise might be to [truncate] after a certain time and display a simple message. Something like "We had to [truncate] this video because it was too long. Full version is available online".

Educators will be horrified but yes. Until WMF provides us a feed of (ALMOST ANY!) statistics about video usage / quality / importance — this 1st order strategy (chop, chop, chop) might in fact work decently, at least initially.

(Whereas in future years, video highlights [salient video clips auto-extracted from longer longer video reels] might also be auto-identified & auto-extracted [e.g. if we're able to work with an experienced video mass-production house?] building on WMF stats, pointing towards which videos are truly more valuable.)

holta commented 1 year ago

Truncating videos longer than ~60 seconds is one approach:

[truncate] after a certain time

Or, if WMF (Wikipedia Foundation) cannot provide us insightful video stats (can they?), then eliminating videos from articles (pages) with a low rank/score might work well enough, and immediately:

As we already know every article page's rank/score.

BACKGROUND: Every English Wikipedia article (page) has a rank/score calculated based on its: 1) Traffic/usage 2) Wikipedia community's quality rating 3) Wikipedia community's importance rating

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.