mafintosh / tar-stream

tar-stream is a streaming tar parser and generator.
MIT License
411 stars 93 forks source link

"Invalid tar header: unknown format." from valid tar file, IANA tz database #133

Closed kshetline closed 1 year ago

kshetline commented 3 years ago

I'm creating a tool to automatically extract and decode timezone data. It's working well except for a couple of odd cases where my MacOS command line and desktop tools can untar these files without a hitch, but tar-stream can't decode them for some reason.

Here are the problem cases:

https://data.iana.org/time-zones/releases/tzdata1997b.tar.gz https://data.iana.org/time-zones/releases/tzdata1997c.tar.gz

The gunzip step of the process works fine, it's just the untarring that fails.

export async function getByUrlOrVersion(urlOrVersion?: string, displayProgress = false): Promise<TzData> {
  let url: string;
  let requestedVersion: string;

  if (!urlOrVersion)
    url = DEFAULT_URL;
  else if (urlOrVersion.includes(':'))
    url = urlOrVersion;
  else {
    requestedVersion = urlOrVersion;
    url = URL_TEMPLATE_FOR_VERSION.replace('{version}', urlOrVersion);
  }

  const extract = tar.extract();
  const data = await requestBinary(url, { headers: { 'User-Agent': 'curl/7.64.1' }, autoDecompress: true });
  fs.writeFileSync('foo.tar', data); // <-- Dumped gunzipped data to a file to check it. Untars by double-clicking on the file just fine.
  const stream = Readable.from(data);
  const result: TzData = { version: requestedVersion || 'unknown', sources: {} };
  let error: any;

  extract.on('entry', (header, stream, next) => {
    const sourceName = header.name;

    if (!error && (TZ_SOURCE_FILES.has(sourceName) || sourceName === 'version')) {
      let data = '';

      if (displayProgress && sourceName !== 'version')
        console.info(`Extracting ${sourceName}`);

      stream.on('data', chunk => data += chunk.toString());
      stream.on('error', err => error = err);
      stream.on('end', () => {
        if (sourceName === 'version') {
          result.version = data.trim();

          if (displayProgress && result.version)
            console.info(`tz database version ${result.version}`);
        }
        else
          result.sources[sourceName] = data;

        next();
      });
    }
    else
      stream.on('end', next);

    if (displayProgress && !result.version)
      console.info('unknown tz database version');

    stream.resume();
  });

  return new Promise<TzData>((resolve, reject) => {
    stream.pipe(extract);
    extract.on('finish', () => error ? reject(makeError(error)) : resolve(result));
    extract.on('error', err => reject(makeError(err)));
  });
}

I'm not sure what's so special about these particular archives that makes tar-stream choke.

kshetline commented 3 years ago

If I use allowUnknownFormat: true I get Error: Unexpected end of data, but at least all of the data I care about has been extracted before the error occurs, so at least I've got a workaround for now.

    extract.on('error', err => {
      if (/unexpected end of data/i.test(err.message) && Object.keys(result.sources).length >= 11)
        resolve(result);
      else
        reject(makeError(err));
    });
kshetline commented 3 years ago

I ran into the same issue with: https://data.iana.org/time-zones/releases/tzdata1999f.tar.gz

...except this time the error I get is Error: Invalid tar header. Maybe the tar is corrupted or it needs to be gunzipped?. I can still extract the data that I need to extract before the error occurs.

illright commented 1 year ago

These archives seem to have malformed headers. If you run the command gzip -c -d tzdata1997b.tar.gz > tzdata1997b.tar, you get to the tar file underneath, which you can then investigate with a hex editor.

Here's what these archives look like:

00000000: 6166 7269 6361 0000 0000 0000 0000 0000  africa..........
<binary nonsense>
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................

And here's what a good tar archive should look like:

00000000: 6e6f 6465 5f6d 6f64 756c 6573 2f00 0000  node_modules/...
<binary nonsense>
00000100: 0075 7374 6172 0030 3069 6c6c 7269 6768  .ustar.00illrigh

The difference is in the last line, namely how it contains .ustar.00. This is called a "magic", which is what programs use to recognize tar archives. It's missing from IANA archives (for whatever reason), which is why the allowUnknownFormat: true is needed to bypass the header check.

The other issue, the one where you get Error: Unexpected end of data is because the archive contains the header for the file yearistype.sh twice, and the first time it actually has the script text but the next one it just abruptly ends. All in all, these archives are terribly malformed, and the reason why tar and the like can consume it with no problems is probably because they aggressively silence errors and try to make the best of the file. Your hacky solution is probably the best one for these kinds of files.

mafintosh commented 1 year ago

send a test if still relevant

kshetline commented 1 year ago

Sorry I didn't respond sooner. It doesn't surprise me that the problem files are the real problem. It would be useful, if you care to add such an option, to allow more lenient decoding to get around issues like these.