:sparkles: Page dimensions analysis task

JMicheli commented 4 weeks ago

This pull request aims to close #180.

I've added a new type of task to the media analysis job created in #307, AnalyzePageDimensions. This task attempts to determine the dimensions of each page in a media item and then stores its findings to the database so that they can be accessed through the server's API.

The database structure for storing the dimensions data is shown below:

model PageDimensions {
  id String @id @default(cuid())
  dimensions String
  metadata MediaMetadata @relation(fields: [metadata_id], references: [id], onDelete: Cascade)
  metadata_id String @unique

  @@map("page_dimensions")
}

An individual height, width pair is represented in Rust by PageDimension.

One of the concerns discussed when planning this addition was the possibility that a very long book would have a very long list of dimensions, thus necessitating the separate table to store dimensions, making them optional when loading metadata. But for this same reason, the string encoding a vector of PageDimensions could potentially be very large.

To avoid database bloat for users with large libraries/long books, I've implemented a simple compression scheme that serializes Vec<PageDimension> as follows:

// Comma and semicolon-separated serialization of height/width pairs.
let list = vec![
  PageDimension::new(800, 600),
  PageDimension::new(1920, 1080),
  PageDimension::new(800, 600),
];
let serialized_list = dimension_vec_to_string(list);
assert_eq!(serialized_list, "800,600;1920,1080;800,600");

// Repeated values are compressed with shorthand.
let list = vec![
  PageDimension::new(800, 600),
  PageDimension::new(800, 600),
  PageDimension::new(800, 600),
];
let serialized_list = dimension_vec_to_string(list);
assert_eq!(serialized_list, "3>800,600");

I think that this should be effective since many books should have uniform page sizes (excepting the cover and the occasional two-page spread). Thus, many books won't actually need to store that much data in the database. The deserialization function I wrote is copy-free and fast, I think it should not introduce any unwanted overhead.

Finally, the API matches the first option discussed on the issue:

curl -X GET http://stump.cloud/api/v1/media/:id/page/:page/dimensions gives a single dimension for a single page. curl -X GET http://stump.cloud/api/v1/media/:id/dimensions gives a list of dimensions for each page (0-indexed).

aaronleopold commented 4 weeks ago

@JMicheli The compression scheme you outlined is simple and smart! I'm excited to give this a review. I'll try to make time after work sometime this week, but it might land on the weekend. As always, thanks for your time and contributions - I really appreciate it!

JMicheli commented 3 weeks ago

I'm working on getting things ready to merge - as is tradition, this part is often the most difficult for me.

aaronleopold commented 3 weeks ago

No rush on my part! Reach out if you need anything, but I would guess all you need to do is update your feature branch with this repo's experimental branch, either directly if you have a remote set up or just update your fork's experimental and then merge that in

JMicheli commented 3 weeks ago

Alright I think that should do it - you were right, just needed a little cargo clean to get things playing nice again.

aaronleopold commented 3 weeks ago

I fixed a couple of clippy lints and a missing Err, but this should be good to go once the experimental finishes building. I'll merge it shortly, thanks again!

stumpapp / stump

:sparkles: Page dimensions analysis task #349