openzim / kolibri

Convert a Kolibri channel in ZIM file(s)
GNU General Public License v3.0
8 stars 12 forks source link

Multiple videos fails to be encoded in khan_academy_fr #101

Open benoit74 opened 7 months ago

benoit74 commented 7 months ago

1746 videos have failed to be re-encoded in https://farm.openzim.org/pipeline/62191f74-ff73-473d-acc3-49af55fb5f8b/debug (but 2869 have succeeded, so the ratio is not totally bad ^^)

We unfortunately do not have the detailed stdout/stderr of ffmpeg in the log.

Sample error:

[kolibri2zim::2024-03-02 00:04:42,677] ERROR:Error re-encoding 934dda95a398ace5acc82d05496c5a7f.mp4: Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
[kolibri2zim::2024-03-02 00:04:42,678] ERROR:Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 266, in call
    raise attempt.get()
          ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python3.12/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.12/site-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
                      ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kolibri2zim/debug.py", line 55, in safer_reencode
    return reencode(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/zimscraperlib/video/encoding.py", line 82, in reencode
    ffmpeg.check_returncode()
  File "/usr/local/lib/python3.12/subprocess.py", line 502, in check_returncode
    raise CalledProcessError(self.returncode, self.args, self.stdout,
subprocess.CalledProcessError: Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/kolibri2zim/scraper.py", line 482, in video_conversion_completed
    future.result()
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
subprocess.CalledProcessError: Command '['/usr/bin/env', 'ffmpeg', '-y', '-i', 'file:/output/tmp7ct09nmy/934dda95a398ace5acc82d05496c5a7f.mp4', '-max_muxing_queue_size', '9999', '-codec:v', 'libvpx', '-quality', 'best', '-b:v', '128k', '-qmin', '18', '-qmax', '40', '-vf', "scale='480:trunc(ow/a/2)*2'", '-codec:a', 'libvorbis', '-ar', '44100', '-b:a', '48k', '-threads', '1', 'file:/tmp/tmpzcps5jbb/video.tmp.webm']' returned non-zero exit status 1.
benoit74 commented 6 months ago

This occurs at least when same video file (local_file_id in Kolibri database) is used in multiple content_file, in multiple content_node. Each video has two IDs: its id (or vid in our codebase) and its local_file_id (or fid / vfid in our codebase).

The code takes into account that a given content_file id might be updated to point to a new fid. To handle this case, the code is uploading to S3 using the id. What happens here is in fact the opposite:

Command used:

kolibri2zim --name "Khan Academy FR" --output output --channel-id 878ec2e6f88c5c268b1be6f202833cd4 --node-ids 'bccde48efc125433be95abd5f09410e3,28d7539f1bf555c9ac43e1f22636aa33,d6414c0cbb165ac185124d862925ec69' --debug --low-quality --use-webm

ffmpeg error is not displayed in the log due to poor usage of the python-scraperlib, this will be easily fixed

deduplicating videos to process is not as straightforward due to the id / fid confusion.

@rgaudin do you remember if you already saw a given id having changed from fid ? Having experienced a bit with how Studio works, I suspect this never happens in fact. Studio is often (like here) assigning the same fid to multiple id when the editor decides to reuse the same video in multiple topics. But if the editor decides to change the video for a given topic, then a new node is created with its own id (and a new fid if it's a new video, or an already existing one in the other case).

rgaudin commented 6 months ago

Sorry @benoit74 but despite all this information, it's not clear. The ticket mentions an issue with ffmpeg process.

You seem to have encountered something wrong or hardly understandable around IDs while looking at this but it seems unrelated to the current issue ?? Can you maybe open a separate ticket?

benoit74 commented 6 months ago

Exact ffmpeg error is (I just fixed this part, we know have clear logs):

file:/tmp/tmp7vzclxvc/934dda95a398ace5acc82d05496c5a7f.mp4: No such file or directory

This is due to the fact that we remove the original file once it is encoded, and we try to encode 3 times the same file because 3 nodes are using the same file.

rgaudin commented 6 months ago

Now that's clearer. Is there any remaining question?

benoit74 commented 6 months ago

Since there is indeed a remaining question, I assume it is still not crystal clear 🤣

Let me take one example from Khan Academy FR.

New glossary (previous one was confusing):

Inside, database, we encounter the kind of situation below.

node_id vid vfid checksum extension file_size
bccde48efc125433be95abd5f09410e3 a4cb8b88b22841ed8630a59a4aa38784 934dda95a398ace5acc82d05496c5a7f 934dda95a398ace5acc82d05496c5a7f mp4 6124541
28d7539f1bf555c9ac43e1f22636aa33 769bcfbdb2094c7e9f03ecc192bb5d52 934dda95a398ace5acc82d05496c5a7f 934dda95a398ace5acc82d05496c5a7f mp4 6124541
d6414c0cbb165ac185124d862925ec69 476f800db3a541b7a587e835c4c6778b 934dda95a398ace5acc82d05496c5a7f 934dda95a398ace5acc82d05496c5a7f mp4 6124541

In order words, the same vfid (i.e. the same real video/file) is reused in multiple vid (i.e. multiple content_file entries) for multiple node_id (i.e. multiple content_contentnode entries). This is quite normal, it means that the content editor decided to reuse the same video in multiple nodes / courses.

This causes the above-mentioned FFMEG error because:

Digging a bit deeper, the scraper logic feels a bit weird, because:

My proposition is then to change the scraper logic:

Consequence is that we will reencode all videos again, but it is probably the right moment because we have only processed Khan Academy FR for now. And we could even imagine to move manually (with a script of course) all files from vid to vfid in S3.

WDYT about my proposition? (this is the remaining question 😉)

rgaudin commented 6 months ago

It makes sense 👍 . I don't recall exactly but I remember we were working mostly off a couple channels so we had to guess the intent based on limited samples.

I think we wanted to store stuff on S3 with the ID that's on the studio so that a video used in multiple channels (not multiple nodes) would be stored once and would be detected when checking the bucket. Maybe we intended to use vfid and used vid by accident ; I see no reason to purposely use vid based on your explanation and not deleting on S3 has never been a concern, especially given we don't delete on S3 and we we wouldn't be able to since videos can be shared accross channels

The downloading/encoding looks like this now that you've changed the way concurrency works. Because download is I/O bound and encode is CPU-bound, I believe we had the two loops being consumed together. In the line of the previous changes I think we can continue to simplify for maintenance's sake. Linear, expect-able perfomance is a good thing.