inspirehep / plotextractor

Extract images and captions from TeX files in a tar archive.
GNU General Public License v2.0
3 stars 9 forks source link

problem with literal `%20` in file path/name #18

Open tsgit opened 7 years ago

tsgit commented 7 years ago

there is a paper on arXiv which has literal %20 in a directory name, this causes issues when the %20 is converted to a space:

/opt/cds-invenio/var/log/bibsched/102/bibsched_task_1028482.log

 specifies in $a a location ('/opt/cds-invenio/var/tmp/oaiharvest_96159_1_20170529040016_material/2017/05/arXiv:1705.09562/arXiv:1705.09562_plots/Neutral%20Impurity-N-type%20(jinst)/fig-1.png') with problems: /opt/cds-invenio/var/tmp/oaiharvest_96159_1_20170529040016_material/2017/05/arXiv:1705.09562/arXiv:1705.09562_plots/Neutral%20Impurity-N-type%20(jinst)/fig-1.png is not a correct url: [Errno 2] No such file or directory: '/opt/cds-invenio/var/tmp/oaiharvest_96159_1_20170529040016_material/2017/05/arXiv:1705.09562/arXiv:1705.09562_plots/Neutral Impurity-N-type (jinst)/fig-1.png'

in this case the subdirectoy Neutral%20Impurity-N-type%20(jinst) contains a copy of all the files in the top level. This is just bad packaging by the author

$ tar tvzf 1705.09562.tar.gz
-rw-rw-r-- root/root     12332 2017-05-24 07:43 fig-10.png
-rw-rw-r-- root/root     12816 2017-05-24 07:43 fig-11.png
-rw-rw-r-- root/root     13913 2017-05-24 07:43 fig-12.png
-rw-rw-r-- root/root     14148 2017-05-24 07:43 fig-1.png
-rw-rw-r-- root/root     34974 2017-05-24 07:43 fig-2.png
-rw-rw-r-- root/root     11338 2017-05-24 07:43 fig-3.png                                                                                                      
-rw-rw-r-- root/root     11453 2017-05-24 07:43 fig-4.png                                                                                                      
-rw-rw-r-- root/root     11940 2017-05-24 07:43 fig-5.png
-rw-rw-r-- root/root    953301 2017-05-24 07:43 fig-6.png
-rw-rw-r-- root/root     50086 2017-05-24 07:43 fig-7.png
-rw-rw-r-- root/root     12802 2017-05-24 07:43 fig-8.png
-rw-rw-r-- root/root     12727 2017-05-24 07:43 fig-9.png
-rw-rw-r-- root/root     12750 2017-05-24 07:43 jinstpub.sty
-rw-rw-r-- root/root     32817 2017-05-24 07:43 main.tex
drwxrwxr-x root/root         0 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/
-rw-rw-r-- root/root     14148 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-1.png
-rw-rw-r-- root/root     12332 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-10.png
-rw-rw-r-- root/root     12816 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-11.png
-rw-rw-r-- root/root     13913 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-12.png
-rw-rw-r-- root/root     34974 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-2.png
-rw-rw-r-- root/root     11338 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-3.png
-rw-rw-r-- root/root     11453 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-4.png
-rw-rw-r-- root/root     11940 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-5.png
-rw-rw-r-- root/root    953301 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-6.png
-rw-rw-r-- root/root     50086 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-7.png
-rw-rw-r-- root/root     12802 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-8.png
-rw-rw-r-- root/root     12727 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/fig-9.png
-rw-rw-r-- root/root     12750 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/jinstpub.sty
-rw-rw-r-- root/root     32817 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/main.tex
-rw-rw-r-- root/root      2684 2017-05-24 07:43 Neutral%20Impurity-N-type%20(jinst)/slashbox.sty
-rw-rw-r-- root/root      2684 2017-05-24 07:43 slashbox.sty

however in general directory and filenames might contain sequences that should not be interpreted as urlescapes

tsgit commented 6 years ago

another example:

$ tar tvf 1712.05415.tar 
-rw-rw-r-- root/root     31304 2017-12-14 08:45 B2basic.pdf
-rw-rw-r-- root/root     51146 2017-12-14 08:45 B2chambers.pdf
-rw-rw-r-- root/root     38320 2017-12-14 08:45 B2unitarity.pdf
drwxrwxr-x root/root         0 2017-12-14 08:45 BGG%20for%20Lie(1)/
-rw-rw-r-- root/root      7929 2017-12-14 08:45 BGG%20for%20Lie(1)/main.bbl
-rw-rw-r-- root/root     51146 2017-12-14 08:45 BGG%20for%20Lie(1)/B2chambers.pdf
-rw-rw-r-- root/root     38320 2017-12-14 08:45 BGG%20for%20Lie(1)/B2unitarity.pdf
-rw-rw-r-- root/root     31304 2017-12-14 08:45 BGG%20for%20Lie(1)/B2basic.pdf
-rw-rw-r-- root/root     11618 2017-12-14 08:45 BGG%20for%20Lie(1)/jheppub.sty
-rw-rw-r-- root/root     19446 2017-12-14 08:45 BGG%20for%20Lie(1)/JHEP.bst
-rw-rw-r-- root/root    140166 2017-12-14 08:45 BGG%20for%20Lie(1)/main.tex
-rw-rw-r-- root/root     19446 2017-12-14 08:45 JHEP.bst
-rw-rw-r-- root/root     11618 2017-12-14 08:45 jheppub.sty
-rw-rw-r-- root/root      7929 2017-12-14 08:45 main.bbl
-rw-rw-r-- root/root    140166 2017-12-14 08:45 main.tex

leads to

2017-12-18 05:38:18 -->    Stage 2 failed: ERROR: while elaborating FFT tags: fft '([('a', '/opt/cds-invenio/var/tmp/oaiharvest_96159_1_20171218040005_material/2017/12/arXiv:1712.05415/arXiv:1712.05415_plots/BGG%20for%20Lie(1)/B2chambers.png'), ('t', 'Plot'), ('d', '00000 The $B_2$ (shifted) Weyl chambers, their associated Weyl group element in terms of simple reflections $s_i$, their Bruhat order, the simple roots $\\alpha_i$ and the integral weight lattice. The red lines correspond to singular weights, and delimitate the shifted Weyl chambers. The intersections of gray lines correspond to integral weights.'), ('n', 'BGG%20for%20Lie(1)_B2chambers')], ' ', ' ', '', 23)' specifies in $a a location ('/opt/cds-invenio/var/tmp/oaiharvest_96159_1_20171218040005_material/2017/12/arXiv:1712.05415/arXiv:1712.05415_plots/BGG%20for%20Lie(1)/B2chambers.png') with problems: /opt/cds-invenio/var/tmp/oaiharvest_96159_1_20171218040005_material/2017/12/arXiv:1712.05415/arXiv:1712.05415_plots/BGG%20for%20Lie(1)/B2chambers.png is not a correct url: [Errno 2] No such file or directory: '/opt/cds-invenio/var/tmp/oaiharvest_96159_1_20171218040005_material/2017/12/arXiv:1712.05415/arXiv:1712.05415_plots/BGG for Lie(1)/B2chambers.png'
2017-12-18 05:38:18 --> <record>
  <controlfield tag="001">1643671</controlfield>
  <controlfield tag="005">20171218053818.0</controlfield>
  <datafield tag="035" ind1=" " ind2=" ">
    <subfield code="9">arXiv</subfield>
    <subfield code="a">oai:arXiv.org:1712.05415</subfield>
  </datafield>
tsgit commented 6 years ago

I think the problem is with legacy

https://github.com/inspirehep/invenio/blob/prod/modules/bibdocfile/lib/bibdocfile.py#L3765-L3767

    try:
        if is_url_a_local_file(url):
            path = urllib2.urlparse.urlsplit(urllib.unquote(url))[2]

why does a local file need urllib.unquote

this is part of check_valid_url(url) called here:

https://github.com/inspirehep/invenio/blob/prod/modules/bibupload/lib/bibupload.py#L1838-L1843

           if url:
                url = url[0]
                try:
                    check_valid_url(url)
                except StandardError, e: