ropensci / av

Working with Video in R
https://docs.ropensci.org/av
Other
92 stars 10 forks source link

Support for `concat` and unique durations per input file? #21

Closed leonawicz closed 4 years ago

leonawicz commented 4 years ago

If this can already be done with av I am unsure how.

Need/use case: Rendering video from image and/or audio file sequences where each file can have a unique duration.

It is unnecessary for the user to calculate and then create redundant frames in order to obtain an interpolated sequence with a fixed frame rate prior to rendering the video. Also, doing so can create a need for many more frames than the initial set as well as a very high frame rate to make everything line up sufficiently. But all of that can be avoided for the user by using ffmpeg concat.

In my current use case, I use R code to create text files like the following:

Images

file 'out_00.png'
duration 2
file 'out_01.png'
duration 0.222222222222222
file 'out_02.png'
duration 0.222222222222222
file 'out_03.png'
duration 0.222222222222222
file 'out_04.png'
duration 0.666666666666667
file 'out_05.png'
duration 0.666666666666667
file 'out_06.png'
duration 0.666666666666667
file 'out_07.png'
duration 0.333333333333333
file 'out_08.png'
duration 0.333333333333333
file 'out_09.png'
duration 0.333333333333333
file 'out_10.png'
duration 0.333333333333333
file 'out_11.png'
duration 1.33333333333333
file 'out_12.png'
duration 1.33333333333333
file 'out_12.png'

Audio

file 'out.wav'
duration 8.666667

In this example I have 13 images with varying durations in one text file (the last image is entered twice due to an issue with concat itself; that's not a typo). I have another text file which in this case includes an audio file of the same duration as the combined images duration, but this could have used multiple files as well. These files are inputs for concat. I use a call to ffmpeg like the following.

ffmpeg -y -f concat -i input1.txt -f concat -i input2.txt -vsync vfr -pix_fmt yuv420p -vf "fps=30, scale=1280:-2" out.mp4

What is most useful in this use case is the ability to provide a vector argument of numeric durations in seconds paired with a vector of media files. I am not an ffmpeg expert so I cannot speak to how general the possibilities are with concat. Or to what is the best general way to offer the functionality and arguments to R users, such as whether the source media is on disk or they are plots captured from R etc. But overall it would definitely be a huge benefit to be able to assign durations per file in a sequence and send to concat. It is very easy to prepare these simple text files and accompanying ffmpeg calls with R code. Perhaps something like this can be integrated into av?

jeroen commented 4 years ago

Video files always have a fixed framerate. The ffmpeg concat utility simply internally duplicates the frames untill the desired duration (sec * fps).

Maybe you an use a simple wrapper to accomplish this like so:

av_concat <- function(images, duration, framerate = 24, ...){
  input <- rep(images, duration * framerate)
  av_encode_video(input = input, framerate = framerate, ...)
}

And then you could use it like this:

# Generate some images:
library(ggplot2)
png(filename = "test%03d.png", width = 800, height = 600)
example(qplot, ask = FALSE)
dev.off()

# Create the slideshow:
slides <- c("test001.png", "test002.png", "test003.png", "test004.png")
duration <- c(0.5, 2, 0.5, 1)
av_concat(slides, duration)

So that gives you the desired output, however I do see that the output filesize is a bit larger than for ffmpeg concat. I'll see if I can figure out how to fix this. But if you end up uploading to youtube or so this doesn't matter because the video gets converted to another format anyway.

leonawicz commented 4 years ago

I understand, thanks for this. I realize now there is a limitation in this approach, but the limitation also exists with concat anyway.

I am working with aligning images to music by beat which requires that I snap images to unequal time points precisely. At least precisely enough so that any offset between video and audio that occurs or accumulates over a piece of music is not perceptible to human eyes.

Here is a good example of the problem. Since this is musical, say I have music with a tempo of 110 quarter note beats per minute. That's 220 eighth note bpm. Say my music is nothing but a sequence of 24 eighth notes. Between the tempo, the beat of the notes, and the fps of 24, it just happens to be that the number of frames per note is just over 6.5.

> duration <- rep(60 / (110 * 2), 24)
> framerate <- 24
> duration * framerate
 [1] 6.545455 6.545455 6.545455 6.545455 6.545455 ...
> (true_duration <- sum(duration * framerate))
[1] 157.0909
> (rounded_duration <- sum(round(duration * framerate)))
[1] 168
> true_duration - rounded_duration
[1] -10.90909

The longer frames create a lag in the image sequence compared to the accompanying audio of 11 frames or about half a second. This is a very perceptible desync over a piece of music originally only about 6.5 seconds long. A substantial percentage lag. Over the course of a whole song it is very large. But really, even for a very short example it is too much. Any perceptible accumulating lag or lead is too much. But now that's I've done more testing, I can see this is an issue with our without concat.

In this example, in order to avoid rounding frame counts per image, given the tempo, I have to multiply the framerate by 11, giving a new framerate of 264 and total frames = 1728 from only 24 original images. This would be more complex if the notes were not all the same beat. Clearly such an extreme approach is no good. And concat does not handle the issue either.

It seems like I will have to use an example like yours or use concat, but either way I will have to create a more general and robust function for snapping each image to the closest frame in time globally rather than rounding. This seems to be the fundamental problem. Should be pretty easy for me to solve.

In my tests an mp4 made with concat came out proportionately much smaller than without concat, but it may not mean much because aside from applying the same vf in both approaches, I am not sure I can really reproduce all the same ffmpeg command line args I was using above.

leonawicz commented 4 years ago

Something like this is more globally accurate:

.snap_durations_to_framerate <- function(x, fps){
  diff(c(0, round(cumsum(x * fps)))) / fps
}

to snap the durations to an even grid permitted by the fixed framerate. This keeps the deltas from accumulating over time when rounding to whole number frames would tend more in one direction than the other. This could be useful if you do implement something slideshow/duration vector based.

Unfortunately, even with what look like perfectly ideal vectors of durations, even with concat the resulting video just does not match the new input durations and the images continue to lag behind the audio. Also increasing the frame rate to shrink the deltas between the original durations and the grid-snapped durations appears to make no difference in lag. There must be something deeper going on that I will have to figure out. I guess I will close this issue.