python / cpython

The Python programming language
https://www.python.org
Other
63.14k stars 30.23k forks source link

Tarfile cannot upzip .tar.gz which contains directories endswith "_". Throw error: OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect. #116848

Closed JHW5981 closed 7 months ago

JHW5981 commented 7 months ago

Bug report

Bug description:

When I try to unzip a .tar.gz from an arxiv source file, I encounter this problem. I manually check this, and debug it, finding after tar = tarfile.open(...), the directory which endswith "" will be changed into "*", such as `./path/to/warto./path/to/war*` , and leading to this problem. Detailed information is shown below.

import requests
import tarfile

doi = "2101.06381" # paper doi
url = "https://arxiv.org/e-print/" + doi # pdf source file url
s = requests.session()
response = s.get(url)
with open(f"./{doi}.tar.gz", 'wb') as fp: # create .tar.gz
    fp.write(response.content)

tar = tarfile.open(f"./{doi}.tar.gz") # unzip .tar.gz
tar.extractall("./") # extract to current path
tar.close()

Now you may have the .tar.gz file, and the OSError may occur. Full error shows below:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Input [In [18]], in <cell line: 12>()
      [9]     fp.write(response.content)
     [11] tar = tarfile.open(f"[./](https://file+.vscode-resource.vscode-cdn.net/e%3A/projects/vscode/2024spring/DatasetGenius/){doi}.tar.gz") # unzip .tar.gz
---> [12] tar.extractall("[./](https://file+.vscode-resource.vscode-cdn.net/e%3A/projects/vscode/2024spring/DatasetGenius/)") # extract to current path
     [13] tar.close()

File [d:\softwares\work\anaconda3\lib\tarfile.py:2045](file:///D:/softwares/work/anaconda3/lib/tarfile.py:2045), in TarFile.extractall(self, path, members, numeric_owner)
   [2043]         tarinfo.mode = 0o700
   [2044]     # Do not set_attrs directories, as we will do that further down
-> [2045]     self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
   [2046]                  numeric_owner=numeric_owner)
   [2048] # Reverse sort directories.
   [2049] directories.sort(key=lambda a: a.name)

File [d:\softwares\work\anaconda3\lib\tarfile.py:2086], in TarFile.extract(self, member, path, set_attrs, numeric_owner)
   [2083]     tarinfo._link_target = os.path.join(path, tarinfo.linkname)
   [2085] try:
-> [2086]     self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
   [2087]                          set_attrs=set_attrs,
   [2088]                          numeric_owner=numeric_owner)
   [2089] except OSError as e:
   [2090]     if self.errorlevel > 0:

File [d:\softwares\work\anaconda3\lib\tarfile.py:2161], in TarFile._extract_member(self, tarinfo, targetpath, set_attrs, numeric_owner)
   [2159]     self.makefile(tarinfo, targetpath)
   [2160] elif tarinfo.isdir():
-> [2161]     self.makedir(tarinfo, targetpath)
   [2162] elif tarinfo.isfifo():
   [2163]     self.makefifo(tarinfo, targetpath)

File [d:\softwares\work\anaconda3\lib\tarfile.py:2190](file:///D:/softwares/work/anaconda3/lib/tarfile.py:2190), in TarFile.makedir(self, tarinfo, targetpath)
   [2185] """Make a directory called targetpath.
   [2186] """
   [2187] try:
   [2188]     # Use a safe mode for the directory, the real mode is set
   [2189]     # later in _extract_member().
-> [2190]     os.mkdir(targetpath, 0o700)
   [2191] except FileExistsError:
   [2192]     pass

OSError: [WinError 123] 文件名、目录名或卷标语法不正确。: '.\\figs\\quality\\wct*'

And if you check this:

for i in tar:
    print(i.name)

outputs will be:

DivSwapper.bbl
DivSwapper.out
DivSwapper.synctex
DivSwapper.tex
figs
figs/activation.pdf
figs/quality
figs/quality/activation.jpg
figs/quality/wct*
figs/quality/wct*/h.jpg
figs/quality/wct*/o.jpg
figs/quality/wct*/4.jpg
figs/quality/wct*/2.jpg
figs/quality/wct*/3.jpg
figs/quality/wct*/1.jpg
figs/quality/ITN
figs/quality/ITN/h.jpg
figs/quality/ITN/4.jpg
figs/quality/ITN/2.jpg
figs/quality/ITN/3.jpg
figs/quality/ITN/1.jpg
figs/quality/MTS
figs/quality/MTS/h.jpg
figs/quality/MTS/4.jpg
figs/quality/MTS/2.jpg
figs/quality/MTS/3.jpg
figs/quality/MTS/1.jpg
figs/quality/sa.png
figs/quality/avatarDFP
figs/quality/avatarDFP/h.jpg
figs/quality/avatarDFP/o.jpg
figs/quality/avatarDFP/4.jpg
figs/quality/avatarDFP/2.jpg
figs/quality/avatarDFP/3.jpg
figs/quality/avatarDFP/1.jpg
figs/quality/wctDFP
figs/quality/wctDFP/h.jpg
figs/quality/wctDFP/o.jpg
figs/quality/wctDFP/4.jpg
figs/quality/wctDFP/2.jpg
figs/quality/wctDFP/3.jpg
figs/quality/wctDFP/1.jpg
figs/quality/c.jpg
figs/quality/avatar*
figs/quality/avatar*/h.jpg
figs/quality/avatar*/o.jpg
figs/quality/avatar*/4.jpg
figs/quality/avatar*/2.jpg
figs/quality/avatar*/3.jpg
figs/quality/avatar*/1.jpg
figs/quality/s.jpg
figs/more
figs/more/CNNMRF*
figs/more/CNNMRF*/o.jpg
figs/more/CNNMRF*/4.jpg
figs/more/CNNMRF*/5.jpg
figs/more/CNNMRF*/2.jpg
figs/more/CNNMRF*/s.png
figs/more/CNNMRF*/1.jpg
figs/more/style-swap*
figs/more/style-swap*/o.jpg
figs/more/style-swap*/9.jpg
figs/more/style-swap*/2.jpg
figs/more/style-swap*/s.png
figs/more/style-swap*/3.jpg
figs/more/style-swap*/1.jpg
figs/overview2.pdf
figs/portraits
figs/portraits/1
figs/portraits/1/oo.jpg
figs/portraits/1/DFP.jpg
figs/portraits/1/4.jpg
figs/portraits/1/c.jpg
figs/portraits/1/s.jpg
figs/portraits/1/2.jpg
figs/portraits/2
figs/portraits/2/oo.jpg
figs/portraits/2/16.jpg
figs/portraits/2/DFP.jpg
figs/portraits/2/c.jpg
figs/portraits/2/6.jpg
figs/portraits/2/s.jpg
figs/teaser
figs/teaser/wct+
figs/teaser/wct+/4.jpg
figs/teaser/wct+/cs.jpg
figs/teaser/wct+/2.jpg
figs/teaser/wct+/3.jpg
figs/teaser/wct+/1.jpg
figs/teaser/avatar
figs/teaser/avatar/4.jpg
figs/teaser/avatar/cs.jpg
figs/teaser/avatar/2.jpg
figs/teaser/avatar/3.jpg
figs/teaser/avatar/1.jpg
figs/teaser/CNNMRF
figs/teaser/CNNMRF/4.jpg
figs/teaser/CNNMRF/cs.jpg
figs/teaser/CNNMRF/2.jpg
figs/teaser/CNNMRF/3.jpg
figs/teaser/CNNMRF/1.jpg
figs/teaser/styleswap
figs/teaser/styleswap/4.jpg
figs/teaser/styleswap/5.jpg
figs/teaser/styleswap/cs.jpg
figs/teaser/styleswap/3.jpg
figs/teaser/styleswap/1.jpg
figs/tradeoff
figs/tradeoff/activation.jpg
figs/tradeoff/normal2.jpg
figs/tradeoff/normal1.jpg
figs/tradeoff/sa.png
figs/tradeoff/12.jpg
figs/tradeoff/11.jpg
figs/tradeoff/21.jpg
figs/tradeoff/22.jpg
figs/tradeoff/32.jpg
figs/tradeoff/31.jpg
figs/tradeoff/42.jpg
figs/tradeoff/c.jpg
figs/tradeoff/41.jpg
figs/tradeoff/s.jpg
figs/ops.pdf
figs/challenge
figs/challenge/small1.png
figs/challenge/o.jpg
figs/challenge/big.jpg
figs/challenge/10.jpg
figs/challenge/noise.jpg
figs/challenge/4.jpg
figs/challenge/c.jpg
figs/challenge/cs.jpg
figs/challenge/s.jpg
figs/challenge/2.jpg
figs/challenge/3.jpg
ijcai22.bib
ijcai22.sty
named.bst

As you can see, there are many paths contain special character "*" while the original one is "_". I think tarfile package change "_" to "*" somehow but I don't know how, I check the original code but I cannot figure it out. Hope someone can help me solve this problem. Thanks in advance!

CPython versions tested on:

3.9

Operating systems tested on:

Windows

ronaldoussoren commented 7 months ago

I've fetched https://arxiv.org/e-print/2101.06381 manually and checked its file listing with the command line tar tool on macOS, it shows the following file listing:

DivSwapper.bbl
DivSwapper.out
DivSwapper.synctex
DivSwapper.tex
figs/
figs/activation.pdf
figs/quality/
figs/quality/activation.jpg
figs/quality/wct*/
figs/quality/wct*/h.jpg
figs/quality/wct*/o.jpg
figs/quality/wct*/4.jpg
figs/quality/wct*/2.jpg
figs/quality/wct*/3.jpg
figs/quality/wct*/1.jpg
...

The archive contains file names that aren't valid on Windows, which explains the error.

You should be able to work around this limitation by using an extraction filter (see https://docs.python.org/3/library/tarfile.html#extraction-filters). Make sure to call the appropriate builtin filter (see docs) to avoid a number of possible security issues.

JHW5981 commented 7 months ago

Hi, @ronaldoussoren, thank you for your answer. I tried different types of filters, including "fully_trusted," "tar," and "data," but encountered the same OSError as before. However, manually replacing those "*" with "" fixed the problem.

import requests
import tarfile

doi = "2101.06381" # paper doi
url = "https://arxiv.org/e-print/" + doi # pdf source file url
s = requests.session()
response = s.get(url)
with open(f"./{doi}.tar.gz", 'wb') as fp: # create .tar.gz
    fp.write(response.content)

tar = tarfile.open(f"./{doi}.tar.gz") # unzip .tar.gz
for t in tar:
    if "*" in t.name:
        t.name = t.name.replace("*", "")
tar.extractall("./") # extract to current path
tar.close()

I wonder if there exists a more pythonic way.

ronaldoussoren commented 7 months ago

A cleaner way to to this is to use the filter argument for extractall:

import tarfile

def name_filter(member, path):
    # First use the default 'data' filter:
    member = tarfile.data_filter(member, path)

    # Return a copy of the TarInfo with a cleaned up name:
    return member.replace(name=member.name.replace('*', 'STAR'))

doi = "2101.06381" # paper doi
tar = tarfile.open(doi)
tar.extractall("./", filter=name_filter) # extract to current path
tar.close()
JHW5981 commented 7 months ago

Thanks a lot! @ronaldoussoren