mirage / ocaml-tar

Pure OCaml library to read and write tar files
ISC License
54 stars 34 forks source link

Tar.Header.Header.unmarshal should treat `Normal` entries with trailing slash name as `Directory`. #129

Closed gravicappa closed 9 months ago

gravicappa commented 1 year ago

As written here.

Also, for backward compatibility, tar treats a regular file whose name ends with a slash as a directory.

And I have encountered archives like that.

reynir commented 1 year ago

Thanks, I was not aware.

How do you think we should handle this? There's the Tar module, but also the Tar_unix and Tar_unix_lwt.

For the latter two it makes sense to consider this when extracting. In #127 I'm reworking Tar_unix etc.

For Tar I'm unsure if we should do anything differently. When we unmarshal we could synthesize the link_indicator to be Tar.Header.Link.Directory. I'm not sure we should do this. We could do it in the header reader as we already synthesize other things (long names IIRC). What do you think?

I am curious if you know how to create such an archive, or if you have a (small) sample.

gravicappa commented 1 year ago

You can create such archives with bsdtar c --format v7 ….

I would implement that link_indicator fix in Tar.Header.unmarshal either when hdr_version (not level though) is pre-ustar or unconditionally after synthesizing long names (to avoid being misdirected by trailing slash in cut-off name). In my code I just call

let sanitize_header (header: Tar.Header.t) =
  match header.link_indicator with
  | Normal ->
      if path_is_directory header.file_name then
        { header with link_indicator = Directory }
      else
        header
  | _ -> header

in Tar_unix.Archive.with_next_file called proc and it generally works although I'm dealing with docker image layers archives only at the moment.

reynir commented 1 year ago

I was curious how gnu tar would treat a tar archive with a pax header setting the file name to clearly/a/directory/ followed by a header for an empty file placeholder with normal link indicator. So I created that with ocaml-tar, and interestingly tar treats the contents as a directory clearly/a/directory/:

$ tar -tvf pax-shenanigans.tar 
dr-------- 0/0               0 1970-01-01 01:00 clearly/a/directory/

Here's the sample archive (gzipped so github accepts it) pax-shenanigans.tar.gz

I will make a PR shortly implementing your second proposal.

reynir commented 1 year ago

tar.2.5.1 has been submitted to opam-repository with a fix. Thank you for the report and the discussion on how to fix it.

gravicappa commented 1 year ago

It seems that I was wrong: the fix up for link_indicator should be done in read_header instead: file_name is being updated there.

And I encountered archive (which unfortunately I can't provide) there LongLink contains full name and file_name is cut off right after /.

gravicappa commented 1 year ago

This is how I fixed the issue locally (probably missed a couple of other issues):

diff --git a/lib/tar.ml b/lib/tar.ml
index 0ff1346..16de231 100644
--- a/lib/tar.ml
+++ b/lib/tar.ml
@@ -522,17 +522,7 @@ module Header = struct
         let mod_time = match extended.Extended.mod_time with
           | None -> get_hdr_mod_time c
           | Some x -> x in
-        let link_indicator =
-          let link_indicator = get_hdr_link_indicator c in
-          (* For backward compatibility we treat normal files ending in slash
-             as directories. Because [Link.of_char] treats unrecognized link
-             indicator values as normal files we check directly *)
-          if String.length file_name > 0 && file_name.[String.length file_name - 1] = '/' &&
-             (link_indicator = '0' || link_indicator = '\000') then
-            Link.Directory
-          else
-            Link.of_char ~level link_indicator
-        in
+        let link_indicator = Link.of_char ~level (get_hdr_link_indicator c) in
         let uname = match extended.Extended.uname with
           | None -> if ustar then get_hdr_uname c else ""
           | Some x -> x in
@@ -718,6 +708,17 @@ module HeaderReader(Async: ASYNC)(Reader: READER with type 'a t = 'a Async.t) =
           | None -> return (Error `Eof)
         end in

+    let true_link_indicator link_indicator file_name =
+      (* For backward compatibility we treat normal files ending in slash
+         as directories. Because [Link.of_char] treats unrecognized link
+         indicator values as normal files we check directly *)
+      if String.length file_name > 0
+         && file_name.[String.length file_name - 1] = '/'
+         && link_indicator = Header.Link.Normal then
+        Header.Link.Directory
+      else
+        link_indicator in
+
     let rec read_header (file_name, link_name, hdr) : (Header.t, [`Eof]) result Async.t =
       let raw_link_indicator = Header.get_hdr_link_indicator buffer in
       if (raw_link_indicator = 'K' || raw_link_indicator = 'L') && level = Header.GNU then
@@ -738,7 +739,8 @@ module HeaderReader(Async: ASYNC)(Reader: READER with type 'a t = 'a Async.t) =
       else begin
         let link_name = if link_name = "" then hdr.Header.link_name else link_name in
         let file_name = if file_name = "" then hdr.Header.file_name else file_name in
-        return (Ok {hdr with Header.link_name; file_name })
+        let link_indicator = true_link_indicator hdr.Header.link_indicator file_name in
+        return (Ok {hdr with Header.link_name; file_name; link_indicator })
       end in
     get_hdr ()
     >>= function
hannesm commented 1 year ago

reopening, thanks for your testing and reporting!

reynir commented 1 year ago

Dear @gravicappa, thank you for your report and for sharing your own fix, and sorry for not reacting earlier.

Unfortunately Link.of_char returns Link.Normal if the link indicator is unknown - so maybe a bit too much is treated as directories. Ideally, only '0' and '\000' would receive this treatment. I don't know how often these unknown-to-us link indicators occur in the wild, and I imagine the chances are slim that they end in a slash so your solution is probably acceptable. In #127 I will reconsider returning Link.Normal on "unknown" link indicators.

reynir commented 1 year ago

A fix's released in 2.6.0. I will keep this issue open until I have a fix for the next branch as well. Thanks again for your reports and suggestions!

hannesm commented 9 months ago

Done in #127