nicknsy / jellyscrub

Smooth mouse-over video scrubbing previews for Jellyfin.
MIT License
668 stars 27 forks source link

Generate BIF files in parallel #77

Closed kura closed 1 year ago

kura commented 1 year ago

I have a pretty big library and even with hardware acceleration it's taking some time to plough through all those files and generate the BIF files. I think it could be nice to generate multiple BIF files simulatenously or at least generate multiple sizes per media file simulatenously.

How would you feel about making it possible to do parallel generation and make it a configurable option?

nicknsy commented 1 year ago

My only concern with this is that ffmpeg already trys to maximize hardware usage like cores and threads and by default will put the cpu to 80-100%. Using HW accel also put my gpu decode to 100%. So if another generation task is added I wonder if it wouldn't end up doing two times the work at half the speed leading to no improvement.

It's worth doing some actual tests but I'm not sure if it's something that will be beneficial.

kura commented 1 year ago

That is fair. It's also only really an issue for me currently because my library is pretty large and I'm starting from 0%>

When doing software decoding it'll mostly utilise my CPU, which'll sit at around 80%. So notuch headroom.

With HW acceleration though it's only utilising 3% of my GPU. Which is why I figured I'd have headroom to do parallel work. I also use a Quadro card so I think that allows for more parallel encode/decode tasks compared to consumer cards without a driver patch. So that may not be that helpful to other users.

I was tempted to generate the files in parallel myself but didn't want to store them in my media directory and I'm not sure how Jellyfin maps the video files to the internal metadata directory. I guessed it was probably something like an md5 or sha1 sum on the file path or just the file name but I couldn't figure out exactly how it does it.

nicknsy commented 1 year ago

Oh yeah if it's only using 3% GPU maybe it could work out for you. I don't have the hardware to test, but if you want to look into it jellyfin metadata is structured like this:

Given a media ID of 92716b0f099f417b92a57549103eb609 the metadata will be stored in {jellyfin_config}/metadata/library/92/92716b0f099f417b92a57549103eb609/. So it's just the first two numbers of the ID jellyfin generates for that media then another folder using the full ID.

kura commented 1 year ago

How does it generate and map the ID though? The bit I struggled on was a simple way of putting the manually generated files in the right directory.

I figured I'd have to either do an API lookup on the media file or use the media file to generate the ID, like using an md5sum or sha1sum. But wasn't sure exactly how that ID is generated or mapped. I had a brief look in the sqlite databases too but also didn't find any instantly recognisable references to mapping a media file to a specific ID.

nicknsy commented 1 year ago

I think it's a GUID generated from a .net internal library, so I don't think they're predictable from the source file.

If you have Visual Studio you could change private static readonly SemaphoreSlim BifWriterSemaphore = new SemaphoreSlim(1, 1); in VideoProcessor to new SemaphoreSlim(3, 3) for 3 concurrent streams. The only problem is that VideoProcessor is not setup in a very multi-threaded friendly way as target media is not queued, but rather a new instance of VideoProcessor is made that waits on this semaphore. This means that if two duplicate media were to run at the same time they could end up generating and trying to write to the same bif file at the same time.

Also, the media encoder used to spawn ffmpeg processes, OldMediaEncoder, is limited to one process at a time, so multiple resolutions would not generate simultaneously without changing that too.

If you could do a PoC where you copy the ffmpeg extraction command and then run three parallel processes on the same file (changing the output folder for each), on your card they should all finish around the same time. If that were the case I'd be happy to look into getting the code working multi-threaded.

kura commented 1 year ago

Thanks.

I'll have a tinker tomorrow or over the weekend. On 16 Feb 2023, 19:47 +0000, Nick @.***>, wrote:

I think it's a GUID generated from a .net internal library, so I don't think they're predictable from the source file. If you have Visual Studio you could change private static readonly SemaphoreSlim BifWriterSemaphore = new SemaphoreSlim(1, 1); in VideoProcessor to `new SemaphoreSlim(3, 3) for 3 concurrent streams. The only problem is that VideoProcessor is not setup in a very multi-threaded friendly way as target media is not queued, but rather a new instance of the object is made that waits on this semaphore. This means that if two duplicate media were to run at the same time they could end up generating and trying to write to the same bif file at the same time. Also, the media encoder used to spawn ffmpeg processes, OldMediaEncoder, is limited to one process at a time, so multiple resolutions would not generate simultaneously without changing that too. If you could do a PoC where you copy the ffmpeg extraction command and then run three parallel processes on the same file (changing the output folder for each), on your card they should all finish around the same time. If that were the case I'd be happy to look into getting the code working multi-threaded. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kub3let commented 1 year ago

From my observation only one video core is used on QSV, while newer Intel IGPU's have 2.

Running a second ffmpeg process would probably run it on both.

Still I would not bother with that, just let it run for a couple of days, mine took 2.5 days with hardware acceleration on 700 movies.

SandyRodgers-2017 commented 1 year ago

Another solution could be to use multiple gpus. I do this with my script that I made to generate bifs. It can cut down generation time, just make sure the script isn't running on the same directories or it can cause conflicts.

kura commented 1 year ago

So @nicknsy, I did a few tests. For these tests I used remuxed bluray TV episodes that are about 58 minutes in length. By remux I mean they are basically straight rips from the bluray in to an MKV container, so they are the same size and with the same encoding (h264 Main10) as they have on the disc. 2 of the episodes are 25Gb in size, the other is 18Gb in size.

Test 1 -Baseline - generating a set of 320px images from a single video file:

GPU usage:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        Off  | 00000000:01:00.0 Off |                  N/A |
| 51%   41C    P0    26W /  75W |    615MiB /  5120MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     15705      C   ...ib/jellyfin-ffmpeg/ffmpeg      611MiB |
+-----------------------------------------------------------------------------+

Time taken:

real    6m41.168s
user    0m57.486s
sys     0m11.421s

Test 2 - Same episode as test 1, generates 320px and 640px images in parallel using GPU decoding:

GPU usage:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        Off  | 00000000:01:00.0 Off |                  N/A |
| 52%   43C    P0    26W /  75W |   1226MiB /  5120MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     24546      C   ...ib/jellyfin-ffmpeg/ffmpeg      611MiB |
|    0   N/A  N/A     24547      C   ...ib/jellyfin-ffmpeg/ffmpeg      611MiB |
+-----------------------------------------------------------------------------+

Time taken:

real    13m23.880s
user    1m2.782s
sys     0m14.604s

real    13m23.893s
user    1m2.160s
sys     0m15.775s

Test 3 - Takes 3 remuxed bluray episodes and generates 320px images in parallel:

GPU usage:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2000        Off  | 00000000:01:00.0 Off |                  N/A |
| 57%   54C    P0    25W /  75W |   1837MiB /  5120MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1735      C   ...ib/jellyfin-ffmpeg/ffmpeg      611MiB |
|    0   N/A  N/A      1738      C   ...ib/jellyfin-ffmpeg/ffmpeg      611MiB |
|    0   N/A  N/A      1741      C   ...ib/jellyfin-ffmpeg/ffmpeg      611MiB |
+-----------------------------------------------------------------------------+

Time taken:

real    15m32.133s
user    0m49.414s
sys     0m15.247s

real    17m39.238s
user    1m0.021s
sys     0m19.074s

real    18m5.899s
user    1m5.027s
sys     0m18.635s

Conclusion

I think what this shows is there is some benefit to running multiple hardware accelerated decodes in parallel, but nowhere near as much as I was expecting. I'm unsure why too, since you can see it's not an issue of power, GPU utilisation, or GPU memory.

nicknsy commented 1 year ago

Very interesting results! Thank you for taking the time to do this.

It shows almost no benefit, but also the gpu utilization goes from 9% with one process to 10% with three, which would seem to indicate that it's only using a single streaming multiprocessor.

Could you post the ffmpeg command you used for the three process test? If you run the command again removing any -threads arguments does it change anything?

kura commented 1 year ago

That utilisation value fluctuates, that was me just grabbing it during the runs. I was watching it with the watch command throughout. It'd bounce from 8-12%.

As for the ffmpeg command, it is the exact same command that the Jellyfin plugin uses, just grabbed from the log with the file path changed. All I did was wrap it using find /path/ -type f -name '*.mkv' | while read f and then used $f as the input file and as part of the output path.

I can try it again later, removing the threads argument sure.

kura commented 1 year ago

@nicknsy I tried it without the -thread arguments, no difference. Here it's even more apparent how bad this approach is.

I also don't know why, maybe it struggles with 1 NVDEC chips and parallel decodes. Yet my GPU only has 1 NVENC chip too and it can transcode multiple streams concurrently using HW acceleration without any issue.

I did these runs with a smaller file (7Gb anime remux).

1 decode

real    1m32.774s
user    0m19.473s
sys     0m4.379s

3 parallel decodes

real    4m38.486s
user    0m22.372s
sys     0m6.441s

real    4m38.565s
user    0m22.499s
sys     0m6.294s

real    4m38.571s
user    0m22.518s
sys     0m6.328s

Script

Note: this sort of cheats a little, it uses the same file to decode with to make things a bit easier.

#!/bin/bash
if [[ "$#" -ne "1" ]]
then
  echo "Usage: trickplay.sh <number_of_parallel_runs>"
  exit
fi

i=1
while true
do
  if [[ "$i" -gt "$1" ]]
  then
    break
  fi

  mkdir "/media/__temporary_content/$i"
  time /usr/lib/jellyfin-ffmpeg/ffmpeg \
   -loglevel error \
   -init_hw_device cuda=cu:0 \
   -filter_hw_device cu \
   -hwaccel cuda \
   -hwaccel_output_format cuda \
   -autorotate 0 \
   -i file:"/media/__temporary_content/filename.mkv" \
   -autoscale 0 \
   -an \
   -sn \
   -vf "fps=1/10,setparams=color_primaries=bt709:color_trc=bt709:colorspace=bt709,scale_cuda=w=640:h=360:format=yuv420p,hwdownload,format=yuv420p" \
   -c:v mjpeg \
   -f image2 \
   "/media/__temporary_content/$i/img_%08d.jpg" &

  i=$((i+1))
done
nicknsy commented 1 year ago

Is the GPU urilization still the same with the threads not set?

kura commented 1 year ago

Yep. With this source file it doesn't breach 3% utilisation no matter if it's a single process or 3 in parallel.

nicknsy commented 1 year ago

Damn. I'm really not knowledgeable enough about HWA to know why that would be the case, but I appreciate you testing all this. Maybe ffmpeg treats no threads differently with the GPU.

kura commented 1 year ago

No worries, I have a couple of thoughts on some more things I want to test out just for the hell of it now.

My library is now at nearly 20% so only another 8 days or so to go, now this is more of a curiosity for me than anything else.

kura commented 1 year ago

I tried a few things over lunch, rewrote the ffmpeg command to be much simpler, using nv12 rather than yuv420p, and none of it helped. I also tried using hwupload_cuda alongside hwdownload and it does not like that at all. Nor does it like using mjpeg_cuvid.

for i in encoders decoders filters; do echo "$i:"; /usr/lib/jellyfin-ffmpeg/ffmpeg -hide_banner -${i} | egrep -i "npp|cuvid|nvenc|cuda|nvdec"; done
encoders:
 V....D h264_nvenc           NVIDIA NVENC H.264 encoder (codec h264)
 V....D hevc_nvenc           NVIDIA NVENC hevc encoder (codec hevc)
decoders:
 V..... av1_cuvid            Nvidia CUVID AV1 decoder (codec av1)
 V..... h264_cuvid           Nvidia CUVID H264 decoder (codec h264)
 V..... hevc_cuvid           Nvidia CUVID HEVC decoder (codec hevc)
 V..... mjpeg_cuvid          Nvidia CUVID MJPEG decoder (codec mjpeg)
 V..... mpeg1_cuvid          Nvidia CUVID MPEG1VIDEO decoder (codec mpeg1video)
 V..... mpeg2_cuvid          Nvidia CUVID MPEG2VIDEO decoder (codec mpeg2video)
 V..... mpeg4_cuvid          Nvidia CUVID MPEG4 decoder (codec mpeg4)
 V..... vc1_cuvid            Nvidia CUVID VC1 decoder (codec vc1)
 V..... vp8_cuvid            Nvidia CUVID VP8 decoder (codec vp8)
 V..... vp9_cuvid            Nvidia CUVID VP9 decoder (codec vp9)
filters:
 ... chromakey_cuda    V->V       GPU accelerated chromakey filter
 ... hwupload_cuda     V->V       Upload a system memory frame to a CUDA device.
 ... overlay_cuda      VV->V      Overlay one video on top of another using CUDA
 ... scale_cuda        V->V       GPU accelerated video resizer
 ... thumbnail_cuda    V->V       Select the most representative frame in a given sequence of consecutive frames.
 ... tonemap_cuda      V->V       GPU accelerated HDR to SDR tonemapping
 T.. yadif_cuda        V->V       Deinterlace CUDA frames

I wanted to try using npp for scaling but I think the version of jellyfin-ffmpeg used by the linuxserver/docker-jellyfin image might not be built with npp support because despite having those libraries on the system scale_npp was still not listed as an available filter. ¯\_(ツ)_/¯