Trouble extracting long filenames

Colin-Fredericks commented 1 month ago

Greetings! I'm trying to extract files from a tar that has some very long filenames. Counting directory names, one of them is over 200 characters. It will extract properly from the command line using good old tar -xzf, but using Tarball.extract() seems to truncate the filenames.

Here's my test file: test.tar.gz

Inside that tarball is a file with this name: repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt . The original project has several like that. Sadly I don't have a choice about the filenames in this project, so just shortening them isn't an option.

Here's my code:

  // Ungzip the file.
  let input_buffer = new Uint8Array(await input_file.arrayBuffer());
  let file_data = pako.ungzip(input_buffer, {});
  // Untar the file.
  const tar_entries = Tarball.extract(file_data);
  for(let entry of tar_entries) {
    console.debug(entry.fileName);
  }

Here's the output:

._test_tar 
PaxHeader/test_tar 
test_tar/ 
test_tar/._repository 
test_tar/PaxHeader/repository 
test_tar/repository/ 
test_tar/._test.json 
test_tar/PaxHeader/test.json 
test_tar/test.json 
test_tar/repository/._test2.json 
test_tar/repository/PaxHeader/test2.json 
test_tar/repository/test2.json 
test_tar/repository/._assets 
test_tar/repository/PaxHeader/assets 
test_tar/repository/assets/ 
test_tar/repository/assets/._test3.txt 
test_tar/repository/assets/PaxHeader/test3.txt 
test_tar/repository/assets/test3.txt 
PaxHeader/._0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703e 
._0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895 
PaxHeader/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebc 
0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d189508

You can see that the filename was trimmed on both ends - the directories were dropped, and then any part of the filename past 100 characters was removed. I can get the directories with entry.fileNamePrefix(), but I don't see a way to get the rest of it.

Is there a way to read in the files with the full filenames? (Also, am I going to run into trouble with writing the output too?)

Thanks for your help.

Tested on Firefox 131.0.2 (aarch64) and Chrome 129.0.6668.101

jospete commented 1 month ago

Hi @Colin-Fredericks, yes this is a known issue - the library currently does not support PAX headers. I've been kicking this can down the road for a while now... will look into this in a day or two and see if I can implement support for it.

Do you mind if I pull the test file you provided into the repo to build unit tests around it?

Alternatively, if you don't need this to run exclusively in the browser (i.e. you have access to the node CLI at the time of extraction) you may want to consider using node-tar

Thanks

Colin-Fredericks commented 1 month ago

Thank you for the quick response! You are definitely welcome to use my test file.

I do want to have my tool run entirely client-side in the browser. It might be taking in sensitive files from someone, and this way I don't have to worry at all about transferring or temporarily storing their files - it all stays within their own browser. It also keeps me from having to run any sort of server or API, which cuts down maintenance a little.

Colin-Fredericks commented 1 month ago

Hello again! Just wanted to check in and see whether there's anything I can do that might help.

jospete commented 1 month ago

Hi again, sorry for the delay - the last 7 days have been chaotic in both my work and personal life.

I'm currently researching how best to fit Pax Headers in with this module's current implementation (without breaking anything). It appears that pax headers are a tag-length-value store that will need to be parsed and applied to the existing header fields (found this stackoverflow post and the corresponding POSIX docs).

I'll try to chew through this one afternoon this week, but finding time for it might be difficult.

You're more than welcome to take a stab at implementing a new PaxTarHeader class and open a PR.

Colin-Fredericks commented 4 weeks ago

No worries, and I hope things become less chaotic for you! Take your time.

I took a look through the PaxTarHeader class and I could maybe come up with a crappy workaround just by making some "getLongFilename" and "setLongFilename" methods, but that doesn't sound like something you'd actually want in your library long-term.

jospete commented 4 weeks ago

Yeah, the problem is that the pax header is in the next sector after the header that declares it. With the 4.x implementation, this is extremely difficult to do with how the parsing logic is scattered across several classes. I'm building a 5.x version in #3 which rebuilds this part of the module.

Will post updates here when I have something working.

jospete commented 3 weeks ago

@Colin-Fredericks Hi again, I've published a new package version 5.0.0 that includes PAX header support. Unfortunately, adding this feature meant that I had to reconfigure how some of the external APIs function.

So instead of this (4.x)

const entries = Tarball.extract(buffer);

The call should be like this (5.x)

const {entries} = await Archive.extract(buffer);

Where

Archive is the class that replaces the old Tarball class (took the opportunity to make the name suck slightly less)
the extract() function now returns a promise (de-fragmented sync and async options throughout the module)
the result of extract() will be a Promise<Archive> result

Please test out this new version - the fileName entry field should return the "correct" name from the pax header now

Colin-Fredericks commented 3 weeks ago

Thank you! After switching to the new APIs, it's now reading the filenames successfully. I very much appreciate your work on this.

Is there anything that I should do differently with the addTextFile() and similar functions? I'm getting someArchive entry has empty or unreadable filename ... skipping. errors from tar -xzfv on trying to extract an archive with long filenames. It looks like it's adding files with empty strings as filenames.

jospete commented 3 weeks ago

Are you trying to add files to the extracted PAX archive?

I don't have PAX header serialization in place yet (i.e. not implemented in ArchiveWriter class) - it is extract-only at the moment (i.e. ArchiveReader class).

Colin-Fredericks commented 3 weeks ago

Ah, ok. Yeah, I did this kind of thing:

let new_tarball = new Archive();
new_tarball.addTextFile(f.fileName, file_string);

and then gzipped it with pako and built a download link out of it.

jospete commented 3 weeks ago

I'll work on adding PAX write support over the next few days, stay tuned

jospete commented 3 weeks ago

Version 5.1.0 has been released to include PAX serialization in regard to ArchiveWriter.

Specifically, when file names are given to ArchiveWriter that exceed the default maximum USTAR file name field size, the header will automatically be converted into a PAX header so that the file name does not get truncated.

As for the rest of the PAX-specific field types, I didn't find it necessary to override their USTAR counterparts during serialization; so these fields have not had an override condition implemented for them.

If you find a legitimate case for one of the other PAX fields to be included in the serialization step (e.g. modification time, group name/id, file path prefix, or user name/id), please provide an example and I'll work on getting those in.

Colin-Fredericks commented 3 weeks ago

Thank you! Again, I really appreciate all the work you're putting in here. Do you have a ko-fi account or anything?

Still having one problem, but I think it's almost there. Either that or it's working fine and I'm doing something wrong.

I made a really simple script to see if I could basically duplicate the test archive. It starts with a file from a standard form input:

const input_file = input_file_element.files[0];
const input_buffer = new Uint8Array(await input_file.arrayBuffer());
const file_data = pako.ungzip(input_buffer, {});
const tar_content = await Archive.extract(file_data);

let new_tarball = new Archive();
for(const f of tar_content.entries){
  if(f.fileSize === 0){
    new_tarball.addDirectory(f.fileName);
  }else{
    new_tarball.addBinaryFile(f.fileName, f.toUint8Array());
  }
}

for(const g of new_tarball.entries){
  console.debug(g.fileName);
}

// Re-gzip the file
let tarball_uint8 = new_tarball.toUint8Array();
let gzip_blob = new Blob([pako.gzip(tarball_uint8)], {
  type: "application/gzip",
});

...and then ships it off to a download link. The console for this reads:

._test_tar
test_tar/
test_tar/._repository
test_tar/repository/
test_tar/._test.json
test_tar/test.json
test_tar/repository/._test2.json
test_tar/repository/test2.json
test_tar/repository/._assets
test_tar/repository/assets/
test_tar/repository/assets/._test3.txt
test_tar/repository/assets/test3.txt
test_tar/repository/assets/._0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt
test_tar/repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt

So the filenames are getting into the archive properly now! The download link also comes through fine.

Unfortunately it didn't unpack properly with tar on the command line or with just double-clicking. I ran tar -tvf to get more info:

~/Downloads $ tar -tvf test_course.tgz
drwxrwxrwx  0 0      0           0 Nov  7 15:17 test_tar/
drwxrwxrwx  0 0      0           0 Nov  7 15:17 test_tar/repository/
-rwxrwxrwx  0 0      0        2048 Nov  7 15:17 test_tar/test.json
-rwxrwxrwx  0 0      0        2048 Nov  7 15:17 test_tar/repository/test2.json
drwxrwxrwx  0 0      0           0 Nov  7 15:17 test_tar/repository/assets/
-rwxrwxrwx  0 0      0        2048 Nov  7 15:17 test_tar/repository/assets/test3.txt
tar: Ignoring malformed pax extended attributes
-rw-r--r--  0 cfredericks staff      20 Oct 16 21:37 test_tar/repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d189508
tar: Error exit delayed from previous errors.

In the middle of that is: "tar: Ignoring malformed pax extended attributes". It looks like the last filename includes the folders, but not the end of the filename itself - the filename itself is still 100 characters.

If there's anything I can do to get more useful info, let me know.

jospete commented 3 weeks ago

In this line here

new_tarball.addBinaryFile(f.fileName, f.toUint8Array());

f.toUint8Array() serializes the whole entry - including headers. This might have an unfortunate side effect of inserting headers where a content block is expected on the tar CLI, in turn causing a parse error.

Can you try changing that line to this and see if it works any better?

new_tarball.addBinaryFile(f.fileName, f.content);

Alternatively for a more "whole-sale" solution, if you just want to un-tar and re-tar the file, you can also do this

const archive = await Archive.extract(file_data);
const tarball_uint8 = archive.toUint8Array();

If you don't mind me asking, why exactly do you need to un-tar the file just to re-tar it?

Colin-Fredericks commented 3 weeks ago

I switched to

new_tarball.addBinaryFile(f.fileName, f.content as Uint8Array);

with the "as" part to make TypeScript happy. Still not working, but at least the error message has changed:

~/Downloads $ tar -tvf test_course.tgz
drwxrwxrwx  0 0      0           0 Nov  7 21:58 test_tar/
drwxrwxrwx  0 0      0           0 Nov  7 21:58 test_tar/repository/
-rwxrwxrwx  0 0      0          18 Nov  7 21:58 test_tar/test.json
-rwxrwxrwx  0 0      0          18 Nov  7 21:58 test_tar/repository/test2.json
drwxrwxrwx  0 0      0           0 Nov  7 21:58 test_tar/repository/assets/
-rwxrwxrwx  0 0      0          18 Nov  7 21:58 test_tar/repository/assets/test3.txt
tar: Ignoring malformed pax extended attributes
tar: Archive entry has empty or unreadable filename ... skipping.
tar: Ignoring malformed pax extended attribute
tar: Archive entry has empty or unreadable filename ... skipping.
tar: Error exit delayed from previous errors.

I did get the exact same output from the console.debug(g.fileName) call as before, if that matters.

If you don't mind me asking, why exactly do you need to un-tar the file just to re-tar it?

I don't, it was just the easiest way to test to see if the filenames were working.

I'm actually working with online courses, making changes in the course structure that aren't possible within the platform itself and then reuploading them. The work I'm doing involves partially duplicating a tarball but with a few (or a lot of) specific files changed, like swapping every video from "not downloadable" to "downloadable".

jospete commented 2 weeks ago

Quick status update - switched over to my macbook so I could actually run the tar command and get a reproduction. I'm currently trying to hunt down where the error(s) are coming from in #6.

My current strategy is to compare the hex of test.tar with the output of ./scripts/test-unpack-repack.ts. One issue I've found is a misalignment of the PaxHeader declaration, so that's been fixed... but there is still some issue I can't quite see.

If you're up for helping out, try cloning this repo, and run the following (make sure you have node 20.x available on command line):

cd ./obsidize-tar-browserify
git checkout fix/pax-write-malformed-headers
npm install
npm run build
npm run test:unpack:repack

That should spit out a tar file at tmp/test/pax-unpack-repack/unpack-repack-sample.tar

I've been comparing that file against the one in dev-assets/pax-tgz-sample/packed/test.tar using the Hex Editor plugin for vscode.

The objective is to find exactly what/where there is breakage between the original file you provided and the one that's generated.

Colin-Fredericks commented 2 weeks ago

No problem.

Click to show console output

```zsh GitHub/obsidize-tar-browserify % git checkout fix/pax-write-malformed-headers branch 'fix/pax-write-malformed-headers' set up to track 'origin/fix/pax-write-malformed-headers'. Switched to a new branch 'fix/pax-write-malformed-headers' GitHub/obsidize-tar-browserify % npm install npm warn deprecated abab@2.0.6: Use your platform's native atob() and btoa() methods instead npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported npm warn deprecated domexception@4.0.0: Use your platform's native DOMException instead npm warn deprecated readdir-scoped-modules@1.1.0: This functionality has been moved to @npmcli/fs npm warn deprecated read-package-json@6.0.4: This package is no longer supported. Please use @npmcli/package-json instead. npm warn deprecated npmlog@5.0.1: This package is no longer supported. npm warn deprecated gauge@3.0.2: This package is no longer supported. npm warn deprecated debuglog@1.0.1: Package no longer supported. Contact Support at https://www.npmjs.com/support for more info. npm warn deprecated are-we-there-yet@2.0.0: This package is no longer supported. npm warn deprecated @npmcli/move-file@1.1.2: This functionality has been moved to @npmcli/fs npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated npmlog@6.0.2: This package is no longer supported. npm warn deprecated glob@8.1.0: Glob versions prior to v9 are no longer supported npm warn deprecated gauge@4.0.4: This package is no longer supported. npm warn deprecated are-we-there-yet@3.0.1: This package is no longer supported. npm warn deprecated @npmcli/move-file@2.0.1: This functionality has been moved to @npmcli/fs npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated glob@8.1.0: Glob versions prior to v9 are no longer supported npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated npmlog@6.0.2: This package is no longer supported. npm warn deprecated gauge@4.0.4: This package is no longer supported. npm warn deprecated are-we-there-yet@3.0.1: This package is no longer supported. npm warn deprecated @npmcli/move-file@2.0.1: This functionality has been moved to @npmcli/fs npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated glob@8.1.0: Glob versions prior to v9 are no longer supported npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported added 1473 packages, and audited 1474 packages in 6s 174 packages are looking for funding run `npm fund` for details 4 vulnerabilities (1 moderate, 2 high, 1 critical) To address issues that do not require attention, run: npm audit fix To address all issues, run: npm audit fix --force Run `npm audit` for details. GitHub/obsidize-tar-browserify % npm run build > @obsidize/tar-browserify@5.1.0 build > run-s build:clean build:tsc build:webpack copy:assets > @obsidize/tar-browserify@5.1.0 build:clean > rimraf ./dist > @obsidize/tar-browserify@5.1.0 build:tsc > tsc > @obsidize/tar-browserify@5.1.0 build:webpack > webpack --config webpack.config.js assets by path header/*.ts 13 KiB asset header/tar-header.d.ts 4.6 KiB [compared for emit] + 6 assets assets by path common/*.ts 4.86 KiB asset common/async-uint8-array-iterator.d.ts 2.03 KiB [compared for emit] + 4 assets assets by path archive/*.ts 5.19 KiB asset archive/archive-writer.d.ts 2.23 KiB [compared for emit] + 2 assets assets by path pax/*.ts 8.79 KiB asset pax/pax-tar-header-key.d.ts 5.53 KiB [compared for emit] asset pax/pax-tar-header.d.ts 3.25 KiB [compared for emit] asset es5.js 26.9 KiB [emitted] [minimized] (name: main) asset entry/tar-entry.d.ts 5.09 KiB [compared for emit] asset index.d.ts 1.3 KiB [compared for emit] orphan modules 76.2 KiB [orphan] 17 modules runtime modules 670 bytes 3 modules ./src/index.ts + 17 modules 77.2 KiB [not cacheable] [built] [code generated] webpack 5.75.0 compiled successfully in 1412 ms > @obsidize/tar-browserify@5.1.0 copy:assets > run-p copy:package copy:readme > @obsidize/tar-browserify@5.1.0 copy:readme > cpy ./README.md ./dist/ > @obsidize/tar-browserify@5.1.0 copy:package > cpy ./package.json ./dist/ GitHub/obsidize-tar-browserify % npm run test:unpack:repack > @obsidize/tar-browserify@5.1.0 test:unpack:repack > tsx ./scripts/test-unpack-repack.ts reconstructed > ._test_tar reconstructed > test_tar/ reconstructed > test_tar/._repository reconstructed > test_tar/repository/ reconstructed > test_tar/._test.json reconstructed > test_tar/test.json reconstructed > test_tar/repository/._test2.json reconstructed > test_tar/repository/test2.json reconstructed > test_tar/repository/._assets reconstructed > test_tar/repository/assets/ reconstructed > test_tar/repository/assets/._test3.txt reconstructed > test_tar/repository/assets/test3.txt reconstructed > test_tar/repository/assets/._0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt reconstructed > test_tar/repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt > tar -tvf ./tmp/test/pax-unpack-repack/unpack-repack-sample.tar.gz drwxrwxrwx 0 0 0 0 Nov 12 12:53 test_tar/ drwxrwxrwx 0 0 0 0 Nov 12 12:53 test_tar/repository/ -rwxrwxrwx 0 0 0 18 Nov 12 12:53 test_tar/test.json -rwxrwxrwx 0 0 0 18 Nov 12 12:53 test_tar/repository/test2.json drwxrwxrwx 0 0 0 0 Nov 12 12:53 test_tar/repository/assets/ -rwxrwxrwx 0 0 0 18 Nov 12 12:53 test_tar/repository/assets/test3.txt tar: Ignoring malformed pax extended attribute -rwxrwxrwx 0 0 0 20 Nov 12 12:53 test_tar/repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2 tar: Error exit delayed from previous errors. Error: Command failed: tar -tvf ./tmp/test/pax-unpack-repack/unpack-repack-sample.tar.gz at __node_internal_genericNodeError (node:internal/errors:866:15) at checkExecSyncError (node:child_process:890:11) at execSync (node:child_process:962:15) at main (/Users/colinfredericks/Documents/GitHub/obsidize-tar-browserify/scripts/test-unpack-repack.ts:46:2) { status: 1, signal: null, output: [ null, null, null ], pid: 2410, stdout: null, stderr: null } GitHub/obsidize-tar-browserify % ```

This is on Node v22.9.0, npm v10.9.0

GitHub won't support attaching .tar files directly, so I zipped it up and attached that version: unpack-repack-sample.tar.zip.

jospete / obsidize-tar-browserify

Trouble extracting long filenames #1