Multiple videos fails to be encoded in khan_academy_fr

benoit74 commented 7 months ago

1746 videos have failed to be re-encoded in https://farm.openzim.org/pipeline/62191f74-ff73-473d-acc3-49af55fb5f8b/debug (but 2869 have succeeded, so the ratio is not totally bad ^^)

We unfortunately do not have the detailed stdout/stderr of ffmpeg in the log.

Sample error:

[kolibri2zim::2024-03-02 00:04:42,677] ERROR:Error re-encoding 934dda95a398ace5acc82d05496c5a7f.mp4: Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
[kolibri2zim::2024-03-02 00:04:42,678] ERROR:Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 266, in call
    raise attempt.get()
          ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.12/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
                      ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kolibri2zim/debug.py", line 55, in safer_reencode
    return reencode(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zimscraperlib/video/encoding.py", line 82, in reencode
    ffmpeg.check_returncode()
  File "/usr/local/lib/python3.12/subprocess.py", line 502, in check_returncode
    raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/kolibri2zim/scraper.py", line 482, in video_conversion_completed
    future.result()
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
subprocess.CalledProcessError: Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.

benoit74 commented 6 months ago

This occurs at least when same video file (local_file_id in Kolibri database) is used in multiple content_file, in multiple content_node. Each video has two IDs: its id (or vid in our codebase) and its local_file_id (or fid / vfid in our codebase).

The code takes into account that a given content_file id might be updated to point to a new fid. To handle this case, the code is uploading to S3 using the id. What happens here is in fact the opposite:

Command used:

kolibri2zim --name "Khan Academy FR" --output output --channel-id 878ec2e6f88c5c268b1be6f202833cd4 --node-ids 'bccde48efc125433be95abd5f09410e3,28d7539f1bf555c9ac43e1f22636aa33,d6414c0cbb165ac185124d862925ec69' --debug --low-quality --use-webm

ffmpeg error is not displayed in the log due to poor usage of the python-scraperlib, this will be easily fixed

deduplicating videos to process is not as straightforward due to the id / fid confusion.

@rgaudin do you remember if you already saw a given id having changed from fid ? Having experienced a bit with how Studio works, I suspect this never happens in fact. Studio is often (like here) assigning the same fid to multiple id when the editor decides to reuse the same video in multiple topics. But if the editor decides to change the video for a given topic, then a new node is created with its own id (and a new fid if it's a new video, or an already existing one in the other case).

rgaudin commented 6 months ago

Sorry @benoit74 but despite all this information, it's not clear. The ticket mentions an issue with ffmpeg process.

Is it failing because the source file is missing?
Is it failing because it has already been compressed? Why?

You seem to have encountered something wrong or hardly understandable around IDs while looking at this but it seems unrelated to the current issue ?? Can you maybe open a separate ticket?

benoit74 commented 6 months ago

Exact ffmpeg error is (I just fixed this part, we know have clear logs):

file:/tmp/tmp7vzclxvc/934dda95a398ace5acc82d05496c5a7f.mp4: No such file or directory

This is due to the fact that we remove the original file once it is encoded, and we try to encode 3 times the same file because 3 nodes are using the same file.

rgaudin commented 6 months ago

Now that's clearer. Is there any remaining question?

benoit74 commented 6 months ago

Since there is indeed a remaining question, I assume it is still not crystal clear 🤣

Let me take one example from Khan Academy FR.

New glossary (previous one was confusing):

node_id = id column in content_contentnode table or contentnode_id in content_file table
vid = id column in content_file table
vfid = local_file_id column in content_file table
checksum = checksum column in content_file table
file_size = file_size column in content_file table
extension = extension column in content_file table

Inside, database, we encounter the kind of situation below.

node_id	vid	vfid	checksum	extension	file_size
bccde48efc125433be95abd5f09410e3	a4cb8b88b22841ed8630a59a4aa38784	934dda95a398ace5acc82d05496c5a7f	934dda95a398ace5acc82d05496c5a7f	mp4	6124541
28d7539f1bf555c9ac43e1f22636aa33	769bcfbdb2094c7e9f03ecc192bb5d52	934dda95a398ace5acc82d05496c5a7f	934dda95a398ace5acc82d05496c5a7f	mp4	6124541
d6414c0cbb165ac185124d862925ec69	476f800db3a541b7a587e835c4c6778b	934dda95a398ace5acc82d05496c5a7f	934dda95a398ace5acc82d05496c5a7f	mp4	6124541

In order words, the same vfid (i.e. the same real video/file) is reused in multiple vid (i.e. multiple content_file entries) for multiple node_id (i.e. multiple content_contentnode entries). This is quite normal, it means that the content editor decided to reuse the same video in multiple nodes / courses.

This causes the above-mentioned FFMEG error because:

the scraper uses the vfid to download the proper file and reencode it
the download happens in the nodes processing future/task/thread, so the file is overwritten for each node using same vfid ; not really a problem, but a loss of bandwidth consumption / time / ...
the reencoding is done in the videos processing future/task/thread and the source file is deleted upon successful reencoding, so first video task complete and delete the source file, second video task on same vfid does not find the source file on local disk and fails

Digging a bit deeper, the scraper logic feels a bit weird, because:

it does not take into account the fact that the same vfid could be reused in multiple vid / node_id, causing current issue
the scraper considers that a given vid could change from vfid in the future, so it:
- upload the re-encoded file in S3 with the vid
- add the entry inside the ZIM with the vid
- this causes an increased ZIM size for nothing (we know there is only one file but in above example we would store it 3 times with 3 different filenames, based on the vid
- this also causes increased S3 storage, increased downloads from S3, ...
- I am very perplexed about this, I doubt it can really happen given how Studio works (we will never see a given vid changing from vfid)
a second order problem is that all videos are downloaded in the nodes processing future/task/thread, i.e. when we have a lot of videos to re-encode, they are all stored on the disk until videos processing future/task/thread achieves to start to re-encode them and push them to the ZIM ; since nodes processing is way faster than videos processing, this consumes a lot of disk space

My proposition is then to change the scraper logic:

use the vfid for S3 operations and for additions to the ZIM
- only one file in S3 and in the ZIM when we know in advance it is the same file for sure
- the only concern could be that we will never delete obsolete vfid, but this probably rarely happens and as of today we have the same issue when a vid becomes obsolete, it is not deleted from S3, and it might happen even more often (edition operations on nodes structure in Studio changes the node_id and vid, not the vfid)
detect the situation when a given vfid is reused to re-encode it only once, upload it only once, download it only once, add it to the ZIM only once

Consequence is that we will reencode all videos again, but it is probably the right moment because we have only processed Khan Academy FR for now. And we could even imagine to move manually (with a script of course) all files from vid to vfid in S3.

WDYT about my proposition? (this is the remaining question 😉)

rgaudin commented 6 months ago

It makes sense 👍 . I don't recall exactly but I remember we were working mostly off a couple channels so we had to guess the intent based on limited samples.

I think we wanted to store stuff on S3 with the ID that's on the studio so that a video used in multiple channels (not multiple nodes) would be stored once and would be detected when checking the bucket. Maybe we intended to use vfid and used vid by accident ; I see no reason to purposely use vid based on your explanation and not deleting on S3 has never been a concern, especially given we don't delete on S3 and we we wouldn't be able to since videos can be shared accross channels

The downloading/encoding looks like this now that you've changed the way concurrency works. Because download is I/O bound and encode is CPU-bound, I believe we had the two loops being consumed together. In the line of the previous changes I think we can continue to simplify for maintenance's sake. Linear, expect-able perfomance is a good thing.

openzim / kolibri

Multiple videos fails to be encoded in khan_academy_fr #101