mindeng / nom-exif

Exif/metadata parsing library written in pure Rust, both image (jpeg/heif/heic/jpg/tiff/raf etc.) and video/audio (mov/mp4/3gp/webm/mkv/mka, etc.) files are supported.
https://crates.io/crates/nom-exif
MIT License
47 stars 7 forks source link

fix 32-bit builds (plus some fun cleanup) #10

Closed onkoe closed 1 month ago

onkoe commented 1 month ago

Hey there! I'm using this crate in my Android app, but the build kept failing for 32-bit targets. This PR fixes it and includes some potential style/perf improvements for your troubles.

Changes

There is a slight API change: I exported nom_exif::values::Rational in the public API since it was already being returned as part of EntryValue. EntryValue exposes those as part of EntryValue::as_urational and EntryValue::as_irational (and the array methods).

I also added keywords, categories, and rust-version (MSRV) to the Cargo.toml file. These aren't necessary, but can greatly improve the accessibility of the crate.

Testing

I checked this PR using the following methods:

Everything checked out, including semantic versioning. Please let me know if anything looks questionable - I'm happy to fix it.

Once again, thanks for the excellent crate! :)

mindeng commented 1 month ago

Thank you for the PR! Looks great!

However, the v2 version is about to be released. This version has a lot of new additions and changes. Merging this PR now will cause a lot of conflicts, which may be difficult to resolve.

Could you please take a look at the v2 branch? If possible, I hope this PR can be directly integrated into the v2 version, thank you!

mindeng commented 1 month ago

Hi @onkoe

Version 2.0.0 has been released! You can pull directly from the main branch. Thanks!

onkoe commented 1 month ago

Sounds great! I'll work on this in the coming days!

onkoe commented 1 month ago

@mindeng Hi again! I've been working on this in my spare time, but have noticed some potential problems with usize = u32 targets.

First, nom is somewhat flawed on these platforms. It isn't possible to index a slice past 4 GiB, and nom uses a ToUsize trait that isn't portable at all! (see: https://github.com/rust-bakery/nom/issues/1029)

In reality, you just use nom::streaming for this kind of requirement. Which just leaves one question: is the header.box_size always expected to be under 4 GiB? header.body_size() certainly isn't in my view, so it might be nice to store it in some custom type with methods to help ensure safety.

I'm still downloading ISO/IEC 14496-12:2022 to get a better understanding, so please forgive any confusion on my part. ;D

mindeng commented 1 month ago

@mindeng Hi again! I've been working on this in my spare time, but have noticed some potential problems with usize = u32 targets.

First, nom is somewhat flawed on these platforms. It isn't possible to index a slice past 4 GiB, and nom uses a ToUsize trait that isn't portable at all! (see: rust-bakery/nom#1029)

In reality, you just use nom::streaming for this kind of requirement. Which just leaves one question: is the header.box_size always expected to be under 4 GiB? header.body_size() certainly isn't in my view, so it might be nice to store it in some custom type with methods to help ensure safety.

I'm still downloading ISO/IEC 14496-12:2022 to get a better understanding, so please forgive any confusion on my part. ;D

I think the 4GB cap shouldn't be an issue. When nom-exif traverses boxes to find Exif/metadata, it does not actually read each box, but only parses the header part of the box (usually only 8 bytes). Only when a box that stores Exif/metadata related information is found, the data of the box will actually be read and parsed.

In a video file, the real size is the track data, not the metadata. Therefore, under normal circumstances, nom-exif is unlikely to read data that is too large at one time.

If an abnormal situation occurs (such as file corruption leading to data abnormality, etc.), nom-exif also sets limits. If it exceeds 2GB, it will be considered abnormal and will refuse to read the box (refer to this constant definition: MAX_BODY_LEN).

In addition, in a 32-bit system, since the maximum upper limit of the memory address space is 4GB, plus the kernel overhead, the upper limit of memory left for applications is generally 2GB at most, not to mention that there will be more strict memory restrictions to applications on Android. Therefore, I think this shouldn't be an issue.

onkoe commented 1 month ago

I see! Thanks to your comment and a more in-depth understanding of the standard, I learned the basis for the cap. However, in this case, do you think it would be preferable to use u32 for all these types?

This would prevent all casts, and u32 is always ToUsize on supported targets, removing the errors. Particularly since the cap is there, I don't think this should affect anything.

Would you be open to a PR using this instead?

mindeng commented 1 month ago

I see! Thanks to your comment and a more in-depth understanding of the standard, I learned the basis for the cap. However, in this case, do you think it would be preferable to use u32 for all these types?

This would prevent all casts, and u32 is always ToUsize on supported targets, removing the errors. Particularly since the cap is there, I don't think this should affect anything.

Would you be open to a PR using this instead?

The box_size still needs to be retained as u64 for the following reasons:

onkoe commented 1 month ago

Got it! However, nom imposes some requirements on 32-bit targets with streaming. For example, the following error is significant:

error[E0277]: the trait bound `u64: ToUsize` is not satisfied
   --> src/mov.rs:341:21
    |
341 |     let (_, body) = streaming::take(header.body_size())(remain)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `ToUsize` is not implemented for `u64`
    |
    = help: the following other types implement trait `ToUsize`:
              u16
              u32
              u8
              usize

Before I use my previous changes, I want to ensure that the layout makes sense. Do you think it's true that...

If any of these don't follow, it may be better to consider other options. Two clear options are to either explicitly disallow 32-bit targets, or create an API that can internally read files and get u64-indexable buffers on 32-bit systems.

onkoe commented 1 month ago

Also, if I may ask, is there any reason to keep the deprecated interfaces on 2.0?

If they're not used for anything, please consider removing them in the next major release. I think they add more "entry points" for the library, which might confuse future contributors.

mindeng commented 1 month ago

Got it! However, nom imposes some requirements on 32-bit targets with streaming. For example, the following error is significant:

error[E0277]: the trait bound `u64: ToUsize` is not satisfied
   --> src/mov.rs:341:21
    |
341 |     let (_, body) = streaming::take(header.body_size())(remain)
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `ToUsize` is not implemented for `u64`
    |
    = help: the following other types implement trait `ToUsize`:
              u16
              u32
              u8
              usize

Before I use my previous changes, I want to ensure that the layout makes sense. Do you think it's true that...

  • "A header box's body_size() will usually NOT exceed u32::MAX."
  • "A header box's body_size() doesn't relate to the actual file's size."
  • "BoxHolder does not contain video data."
  • "More generally, no parser requires a slice containing > u32::MAX bytes."

If any of these don't follow, it may be better to consider other options. Two clear options are to either explicitly disallow 32-bit targets, or create an API that can internally read files and get u64-indexable buffers on 32-bit systems.

Reading data that exceeds the size of uszie at once should be considered an error, so I submitted this modification, which should solve the compilation problem.

mindeng commented 1 month ago

Also, if I may ask, is there any reason to keep the deprecated interfaces on 2.0?

If they're not used for anything, please consider removing them in the next major release. I think they add more "entry points" for the library, which might confuse future contributors.

You are right, we should remove these deprecated methods in the next major version. The reason why it has not been deleted at present is mainly to make the upgrade smoother for users of old versions.