Closed JHW5981 closed 7 months ago
I've fetched https://arxiv.org/e-print/2101.06381
manually and checked its file listing with the command line tar
tool on macOS, it shows the following file listing:
DivSwapper.bbl
DivSwapper.out
DivSwapper.synctex
DivSwapper.tex
figs/
figs/activation.pdf
figs/quality/
figs/quality/activation.jpg
figs/quality/wct*/
figs/quality/wct*/h.jpg
figs/quality/wct*/o.jpg
figs/quality/wct*/4.jpg
figs/quality/wct*/2.jpg
figs/quality/wct*/3.jpg
figs/quality/wct*/1.jpg
...
The archive contains file names that aren't valid on Windows, which explains the error.
You should be able to work around this limitation by using an extraction filter (see https://docs.python.org/3/library/tarfile.html#extraction-filters). Make sure to call the appropriate builtin filter (see docs) to avoid a number of possible security issues.
Hi, @ronaldoussoren, thank you for your answer. I tried different types of filters, including "fully_trusted," "tar," and "data," but encountered the same OSError as before. However, manually replacing those "*" with "" fixed the problem.
import requests
import tarfile
doi = "2101.06381" # paper doi
url = "https://arxiv.org/e-print/" + doi # pdf source file url
s = requests.session()
response = s.get(url)
with open(f"./{doi}.tar.gz", 'wb') as fp: # create .tar.gz
fp.write(response.content)
tar = tarfile.open(f"./{doi}.tar.gz") # unzip .tar.gz
for t in tar:
if "*" in t.name:
t.name = t.name.replace("*", "")
tar.extractall("./") # extract to current path
tar.close()
I wonder if there exists a more pythonic way.
A cleaner way to to this is to use the filter
argument for extractall
:
import tarfile
def name_filter(member, path):
# First use the default 'data' filter:
member = tarfile.data_filter(member, path)
# Return a copy of the TarInfo with a cleaned up name:
return member.replace(name=member.name.replace('*', 'STAR'))
doi = "2101.06381" # paper doi
tar = tarfile.open(doi)
tar.extractall("./", filter=name_filter) # extract to current path
tar.close()
Thanks a lot! @ronaldoussoren
Bug report
Bug description:
When I try to unzip a .tar.gz from an arxiv source file, I encounter this problem. I manually check this, and debug it, finding after
tar = tarfile.open(...)
, the directory which endswith "" will be changed into "*", such as `./path/to/warto
./path/to/war*` , and leading to this problem. Detailed information is shown below.Now you may have the .tar.gz file, and the OSError may occur. Full error shows below:
And if you check this:
outputs will be:
As you can see, there are many paths contain special character "*" while the original one is "_". I think tarfile package change "_" to "*" somehow but I don't know how, I check the original code but I cannot figure it out. Hope someone can help me solve this problem. Thanks in advance!
CPython versions tested on:
3.9
Operating systems tested on:
Windows