Open Colin-Fredericks opened 1 month ago
Hi @Colin-Fredericks, yes this is a known issue - the library currently does not support PAX headers. I've been kicking this can down the road for a while now... will look into this in a day or two and see if I can implement support for it.
Do you mind if I pull the test file you provided into the repo to build unit tests around it?
Alternatively, if you don't need this to run exclusively in the browser (i.e. you have access to the node CLI at the time of extraction) you may want to consider using node-tar
Thanks
Thank you for the quick response! You are definitely welcome to use my test file.
I do want to have my tool run entirely client-side in the browser. It might be taking in sensitive files from someone, and this way I don't have to worry at all about transferring or temporarily storing their files - it all stays within their own browser. It also keeps me from having to run any sort of server or API, which cuts down maintenance a little.
Hello again! Just wanted to check in and see whether there's anything I can do that might help.
Hi again, sorry for the delay - the last 7 days have been chaotic in both my work and personal life.
I'm currently researching how best to fit Pax Headers in with this module's current implementation (without breaking anything). It appears that pax headers are a tag-length-value store that will need to be parsed and applied to the existing header fields (found this stackoverflow post and the corresponding POSIX docs).
I'll try to chew through this one afternoon this week, but finding time for it might be difficult.
You're more than welcome to take a stab at implementing a new PaxTarHeader
class and open a PR.
No worries, and I hope things become less chaotic for you! Take your time.
I took a look through the PaxTarHeader
class and I could maybe come up with a crappy workaround just by making some "getLongFilename" and "setLongFilename" methods, but that doesn't sound like something you'd actually want in your library long-term.
Yeah, the problem is that the pax header is in the next sector after the header that declares it. With the 4.x implementation, this is extremely difficult to do with how the parsing logic is scattered across several classes. I'm building a 5.x version in #3 which rebuilds this part of the module.
Will post updates here when I have something working.
@Colin-Fredericks Hi again, I've published a new package version 5.0.0 that includes PAX header support. Unfortunately, adding this feature meant that I had to reconfigure how some of the external APIs function.
So instead of this (4.x)
const entries = Tarball.extract(buffer);
The call should be like this (5.x)
const {entries} = await Archive.extract(buffer);
Where
Archive
is the class that replaces the old Tarball
class (took the opportunity to make the name suck slightly less)extract()
function now returns a promise (de-fragmented sync and async options throughout the module)extract()
will be a Promise<Archive>
resultPlease test out this new version - the fileName
entry field should return the "correct" name from the pax header now
Thank you! After switching to the new APIs, it's now reading the filenames successfully. I very much appreciate your work on this.
Is there anything that I should do differently with the addTextFile()
and similar functions? I'm getting someArchive entry has empty or unreadable filename ... skipping.
errors from tar -xzfv
on trying to extract an archive with long filenames. It looks like it's adding files with empty strings as filenames.
Are you trying to add files to the extracted PAX archive?
I don't have PAX header serialization in place yet (i.e. not implemented in ArchiveWriter
class) - it is extract-only at the moment (i.e. ArchiveReader
class).
Ah, ok. Yeah, I did this kind of thing:
let new_tarball = new Archive();
new_tarball.addTextFile(f.fileName, file_string);
and then gzipped it with pako
and built a download link out of it.
I'll work on adding PAX write support over the next few days, stay tuned
Version 5.1.0 has been released to include PAX serialization in regard to ArchiveWriter
.
Specifically, when file names are given to ArchiveWriter
that exceed the default maximum USTAR file name field size, the header will automatically be converted into a PAX header so that the file name does not get truncated.
As for the rest of the PAX-specific field types, I didn't find it necessary to override their USTAR counterparts during serialization; so these fields have not had an override condition implemented for them.
If you find a legitimate case for one of the other PAX fields to be included in the serialization step (e.g. modification time, group name/id, file path prefix, or user name/id), please provide an example and I'll work on getting those in.
Thank you! Again, I really appreciate all the work you're putting in here. Do you have a ko-fi account or anything?
Still having one problem, but I think it's almost there. Either that or it's working fine and I'm doing something wrong.
I made a really simple script to see if I could basically duplicate the test archive. It starts with a file from a standard form input:
const input_file = input_file_element.files[0];
const input_buffer = new Uint8Array(await input_file.arrayBuffer());
const file_data = pako.ungzip(input_buffer, {});
const tar_content = await Archive.extract(file_data);
let new_tarball = new Archive();
for(const f of tar_content.entries){
if(f.fileSize === 0){
new_tarball.addDirectory(f.fileName);
}else{
new_tarball.addBinaryFile(f.fileName, f.toUint8Array());
}
}
for(const g of new_tarball.entries){
console.debug(g.fileName);
}
// Re-gzip the file
let tarball_uint8 = new_tarball.toUint8Array();
let gzip_blob = new Blob([pako.gzip(tarball_uint8)], {
type: "application/gzip",
});
...and then ships it off to a download link. The console for this reads:
._test_tar
test_tar/
test_tar/._repository
test_tar/repository/
test_tar/._test.json
test_tar/test.json
test_tar/repository/._test2.json
test_tar/repository/test2.json
test_tar/repository/._assets
test_tar/repository/assets/
test_tar/repository/assets/._test3.txt
test_tar/repository/assets/test3.txt
test_tar/repository/assets/._0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt
test_tar/repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt
So the filenames are getting into the archive properly now! The download link also comes through fine.
Unfortunately it didn't unpack properly with tar
on the command line or with just double-clicking. I ran tar -tvf
to get more info:
~/Downloads $ tar -tvf test_course.tgz
drwxrwxrwx 0 0 0 0 Nov 7 15:17 test_tar/
drwxrwxrwx 0 0 0 0 Nov 7 15:17 test_tar/repository/
-rwxrwxrwx 0 0 0 2048 Nov 7 15:17 test_tar/test.json
-rwxrwxrwx 0 0 0 2048 Nov 7 15:17 test_tar/repository/test2.json
drwxrwxrwx 0 0 0 0 Nov 7 15:17 test_tar/repository/assets/
-rwxrwxrwx 0 0 0 2048 Nov 7 15:17 test_tar/repository/assets/test3.txt
tar: Ignoring malformed pax extended attributes
-rw-r--r-- 0 cfredericks staff 20 Oct 16 21:37 test_tar/repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d189508
tar: Error exit delayed from previous errors.
In the middle of that is: "tar: Ignoring malformed pax extended attributes". It looks like the last filename includes the folders, but not the end of the filename itself - the filename itself is still 100 characters.
If there's anything I can do to get more useful info, let me know.
In this line here
new_tarball.addBinaryFile(f.fileName, f.toUint8Array());
f.toUint8Array()
serializes the whole entry - including headers.
This might have an unfortunate side effect of inserting headers where a content block is expected on the tar CLI, in turn causing a parse error.
Can you try changing that line to this and see if it works any better?
new_tarball.addBinaryFile(f.fileName, f.content);
Alternatively for a more "whole-sale" solution, if you just want to un-tar and re-tar the file, you can also do this
const archive = await Archive.extract(file_data);
const tarball_uint8 = archive.toUint8Array();
If you don't mind me asking, why exactly do you need to un-tar the file just to re-tar it?
I switched to
new_tarball.addBinaryFile(f.fileName, f.content as Uint8Array);
with the "as" part to make TypeScript happy. Still not working, but at least the error message has changed:
~/Downloads $ tar -tvf test_course.tgz
drwxrwxrwx 0 0 0 0 Nov 7 21:58 test_tar/
drwxrwxrwx 0 0 0 0 Nov 7 21:58 test_tar/repository/
-rwxrwxrwx 0 0 0 18 Nov 7 21:58 test_tar/test.json
-rwxrwxrwx 0 0 0 18 Nov 7 21:58 test_tar/repository/test2.json
drwxrwxrwx 0 0 0 0 Nov 7 21:58 test_tar/repository/assets/
-rwxrwxrwx 0 0 0 18 Nov 7 21:58 test_tar/repository/assets/test3.txt
tar: Ignoring malformed pax extended attributes
tar: Archive entry has empty or unreadable filename ... skipping.
tar: Ignoring malformed pax extended attribute
tar: Archive entry has empty or unreadable filename ... skipping.
tar: Error exit delayed from previous errors.
I did get the exact same output from the console.debug(g.fileName)
call as before, if that matters.
If you don't mind me asking, why exactly do you need to un-tar the file just to re-tar it?
I don't, it was just the easiest way to test to see if the filenames were working.
I'm actually working with online courses, making changes in the course structure that aren't possible within the platform itself and then reuploading them. The work I'm doing involves partially duplicating a tarball but with a few (or a lot of) specific files changed, like swapping every video from "not downloadable" to "downloadable".
Quick status update - switched over to my macbook so I could actually run the tar command and get a reproduction. I'm currently trying to hunt down where the error(s) are coming from in #6.
My current strategy is to compare the hex of test.tar
with the output of ./scripts/test-unpack-repack.ts
. One issue I've found is a misalignment of the PaxHeader
declaration, so that's been fixed... but there is still some issue I can't quite see.
If you're up for helping out, try cloning this repo, and run the following (make sure you have node 20.x available on command line):
cd ./obsidize-tar-browserify
git checkout fix/pax-write-malformed-headers
npm install
npm run build
npm run test:unpack:repack
That should spit out a tar file at tmp/test/pax-unpack-repack/unpack-repack-sample.tar
I've been comparing that file against the one in dev-assets/pax-tgz-sample/packed/test.tar
using the Hex Editor plugin for vscode.
The objective is to find exactly what/where there is breakage between the original file you provided and the one that's generated.
No problem.
This is on Node v22.9.0, npm v10.9.0
GitHub won't support attaching .tar files directly, so I zipped it up and attached that version: unpack-repack-sample.tar.zip.
Greetings! I'm trying to extract files from a tar that has some very long filenames. Counting directory names, one of them is over 200 characters. It will extract properly from the command line using good old
tar -xzf
, but usingTarball.extract()
seems to truncate the filenames.Here's my test file: test.tar.gz
Inside that tarball is a file with this name:
repository/assets/0ea3b7ce6f5bcee9ec14b8ad63692c09e25b3a16fddc29157014efc3c1be927e___72d2f2f5ee29e3e703ebcc5f6d1895081a8d3ff17623fd7dda3a3729cc6bb02e___compsci_01_v1_Advice_for_Unhappy_Programmers_v3_mstr.txt
. The original project has several like that. Sadly I don't have a choice about the filenames in this project, so just shortening them isn't an option.Here's my code:
Here's the output:
You can see that the filename was trimmed on both ends - the directories were dropped, and then any part of the filename past 100 characters was removed. I can get the directories with
entry.fileNamePrefix()
, but I don't see a way to get the rest of it.Is there a way to read in the files with the full filenames? (Also, am I going to run into trouble with writing the output too?)
Thanks for your help.
Tested on Firefox 131.0.2 (aarch64) and Chrome 129.0.6668.101