Open benoit74 opened 7 months ago
This occurs at least when same video file (local_file_id
in Kolibri database) is used in multiple content_file
, in multiple content_node
. Each video has two IDs: its id
(or vid
in our codebase) and its local_file_id
(or fid
/ vfid
in our codebase).
The code takes into account that a given content_file id
might be updated to point to a new fid
. To handle this case, the code is uploading to S3 using the id
. What happens here is in fact the opposite:
Command used:
kolibri2zim --name "Khan Academy FR" --output output --channel-id 878ec2e6f88c5c268b1be6f202833cd4 --node-ids 'bccde48efc125433be95abd5f09410e3,28d7539f1bf555c9ac43e1f22636aa33,d6414c0cbb165ac185124d862925ec69' --debug --low-quality --use-webm
ffmpeg error is not displayed in the log due to poor usage of the python-scraperlib
, this will be easily fixed
deduplicating videos to process is not as straightforward due to the id
/ fid
confusion.
@rgaudin do you remember if you already saw a given id
having changed from fid
? Having experienced a bit with how Studio works, I suspect this never happens in fact. Studio is often (like here) assigning the same fid
to multiple id
when the editor decides to reuse the same video in multiple topics. But if the editor decides to change the video for a given topic, then a new node is created with its own id
(and a new fid
if it's a new video, or an already existing one in the other case).
Sorry @benoit74 but despite all this information, it's not clear. The ticket mentions an issue with ffmpeg process.
You seem to have encountered something wrong or hardly understandable around IDs while looking at this but it seems unrelated to the current issue ?? Can you maybe open a separate ticket?
Exact ffmpeg error is (I just fixed this part, we know have clear logs):
file:/tmp/tmp7vzclxvc/934dda95a398ace5acc82d05496c5a7f.mp4: No such file or directory
This is due to the fact that we remove the original file once it is encoded, and we try to encode 3 times the same file because 3 nodes are using the same file.
Now that's clearer. Is there any remaining question?
Since there is indeed a remaining question, I assume it is still not crystal clear 🤣
Let me take one example from Khan Academy FR.
New glossary (previous one was confusing):
node_id
= id
column in content_contentnode
table or contentnode_id
in content_file
tablevid
= id
column in content_file
tablevfid
= local_file_id
column in content_file
tablechecksum
= checksum
column in content_file
tablefile_size
= file_size
column in content_file
tableextension
= extension
column in content_file
tableInside, database, we encounter the kind of situation below.
node_id | vid | vfid | checksum | extension | file_size |
---|---|---|---|---|---|
bccde48efc125433be95abd5f09410e3 | a4cb8b88b22841ed8630a59a4aa38784 | 934dda95a398ace5acc82d05496c5a7f | 934dda95a398ace5acc82d05496c5a7f | mp4 | 6124541 |
28d7539f1bf555c9ac43e1f22636aa33 | 769bcfbdb2094c7e9f03ecc192bb5d52 | 934dda95a398ace5acc82d05496c5a7f | 934dda95a398ace5acc82d05496c5a7f | mp4 | 6124541 |
d6414c0cbb165ac185124d862925ec69 | 476f800db3a541b7a587e835c4c6778b | 934dda95a398ace5acc82d05496c5a7f | 934dda95a398ace5acc82d05496c5a7f | mp4 | 6124541 |
In order words, the same vfid
(i.e. the same real video/file) is reused in multiple vid
(i.e. multiple content_file
entries) for multiple node_id
(i.e. multiple content_contentnode
entries). This is quite normal, it means that the content editor decided to reuse the same video in multiple nodes / courses.
This causes the above-mentioned FFMEG error because:
vfid
to download the proper file and reencode itvfid
; not really a problem, but a loss of bandwidth consumption / time / ...vfid
does not find the source file on local disk and failsDigging a bit deeper, the scraper logic feels a bit weird, because:
vfid
could be reused in multiple vid
/ node_id
, causing current issuevid
could change from vfid
in the future, so it:
vid
vid
vid
vid
changing from vfid
)My proposition is then to change the scraper logic:
vfid
for S3 operations and for additions to the ZIM
vfid
, but this probably rarely happens and as of today we have the same issue when a vid
becomes obsolete, it is not deleted from S3, and it might happen even more often (edition operations on nodes structure in Studio changes the node_id
and vid
, not the vfid
)vfid
is reused to re-encode it only once, upload it only once, download it only once, add it to the ZIM only onceConsequence is that we will reencode all videos again, but it is probably the right moment because we have only processed Khan Academy FR for now. And we could even imagine to move manually (with a script of course) all files from vid
to vfid
in S3.
WDYT about my proposition? (this is the remaining question 😉)
It makes sense 👍 . I don't recall exactly but I remember we were working mostly off a couple channels so we had to guess the intent based on limited samples.
I think we wanted to store stuff on S3 with the ID that's on the studio so that a video used in multiple channels (not multiple nodes) would be stored once and would be detected when checking the bucket. Maybe we intended to use vfid and used vid by accident ; I see no reason to purposely use vid based on your explanation and not deleting on S3 has never been a concern, especially given we don't delete on S3 and we we wouldn't be able to since videos can be shared accross channels
The downloading/encoding looks like this now that you've changed the way concurrency works. Because download is I/O bound and encode is CPU-bound, I believe we had the two loops being consumed together. In the line of the previous changes I think we can continue to simplify for maintenance's sake. Linear, expect-able perfomance is a good thing.
1746 videos have failed to be re-encoded in https://farm.openzim.org/pipeline/62191f74-ff73-473d-acc3-49af55fb5f8b/debug (but 2869 have succeeded, so the ratio is not totally bad ^^)
We unfortunately do not have the detailed stdout/stderr of ffmpeg in the log.
Sample error: