petermr / openDiagram

Extaction of semantic data from diagrams in scientific and other technical/business documents
Apache License 2.0
1 stars 5 forks source link

search_lib unable to read file #11

Open ayush4921 opened 3 years ago

ayush4921 commented 3 years ago

When reading: file

The following error is observed:

File "C:\Users\Mark\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 340: character maps to <undefined>

petermr commented 3 years ago

Please give the actual file name (I assume it wasn't being read from a URL.

Please give the actual command used.

petermr commented 3 years ago

On my machine (MacOSX) i have a copy of the file:

 /Users/pm286/projects/openVirus/miniproject/disease/1-part/PMC6223252/sections/1_body/0_introduction/4_data_sources/1_p.xml

od -a 1_p.xml gives

0000000    <   ?   x   m   l  sp   v   e   r   s   i   o   n   =   "   1
0000020    .   0   "  sp   e   n   c   o   d   i   n   g   =   "   U   T
0000040    F   -   8   "   ?   >  nl   <   p   >   A   l   l  sp   m   a
0000060    n   u   s   c   r   i   p   t   s  sp   u   s   e   d  sp   i
0000100    n  sp   t   h   i   s  sp   r   e   v   i   e   w  sp   w   e
0000120    r   e  sp   p   u   b   l   i   s   h   e   d  sp   b   e   t
0000140    w   e   e   n  sp   J   a   n   u   a   r   y  sp   1   9   6
0000160    5  sp   a   n   d  sp   D   e   c   e   m   b   e   r  sp   2
0000200    0   1   7   ;  sp   t   h   e   s   e  sp   r   e   p   o   r
0000220    t   s  sp   r   e   l   a   t   e   d  sp   t   o  sp   E   V
0000240    -   A   7   1  sp   i   n   f   e   c   t   i   o   n   s  sp
0000260    w   e   r   e  sp   e   x   t   r   a   c   t   e   d  sp   b
0000300    y  sp   s   e   a   r   c   h   i   n   g  sp   M   e   d   l
0000320    i   n   e  sp   (   N   a   t   i   o   n   a   l  sp   L   i
0000340    b   r   a   r   y  sp   o   f  sp   M   e   d   i   c   i   n
0000360    e   ,  sp   B   e   t   h   e   s   d   a   ,  sp   M   a   r
0000400    y   l   a   n   d   ,  sp   U   S   A   )  sp   a   n   d  sp
0000420    P   u   b   M   e   d  sp   u   s   i   n   g  sp   t   h   e
0000440   sp   p   h   r   a   s   e   s  sp   ?  80  9c  sp  nl  sp   <
0000460    i   t   a   l   i   c   >   e   n   t   e   r   o   v   i   r
0000500    u   s   -   A   7   1   <   /   i   t   a   l   i   c   >  sp
0000520    ?  80  9d  sp   a   n   d  sp   ?  80  9c   m   o   l   e   c
0000540    u   l   a   r  sp   e   p   i   d   e   m   i   o   l   o   g
0000560    y   ?  80  9d  sp   o   r  sp   t   h   e  sp   k   e   y  sp
0000600    w   o   r   d   s  sp   ?  80  9c   p   a   t   h   o   g   e
0000620    n   e   s   i   s   ?  80  9d  sp   o   r  sp   ?  80  9c   v
0000640    a   c   c   i   n   e   .   ?  80  9d  sp   T   h   e  sp   r
0000660    e   s   u   l   t   s  sp   w   e   r   e  sp   l   i   m   i
0000700    t   e   d  sp   t   o  sp   m   a   n   u   s   c   r   i   p
0000720    t   s  sp   a   v   a   i   l   a   b   l   e  sp   i   n  sp
0000740    E   n   g   l   i   s   h   .  nl   <   /   p   >  nl        
0000756
petermr commented 3 years ago

My suspicion is that this file is corrupt. I don't like the ? before 80 9d . This is part of U+201D or e2 80 9d.

In which case we should see if others get the same problem with it.

petermr commented 3 years ago

This is od -c

pm286macbook:4_data_sources pm286$ od -c 1_p.xml 
0000000    <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
0000020    .   0   "       e   n   c   o   d   i   n   g   =   "   U   T
0000040    F   -   8   "   ?   >  \n   <   p   >   A   l   l       m   a
0000060    n   u   s   c   r   i   p   t   s       u   s   e   d       i
0000100    n       t   h   i   s       r   e   v   i   e   w       w   e
0000120    r   e       p   u   b   l   i   s   h   e   d       b   e   t
0000140    w   e   e   n       J   a   n   u   a   r   y       1   9   6
0000160    5       a   n   d       D   e   c   e   m   b   e   r       2
0000200    0   1   7   ;       t   h   e   s   e       r   e   p   o   r
0000220    t   s       r   e   l   a   t   e   d       t   o       E   V
0000240    -   A   7   1       i   n   f   e   c   t   i   o   n   s    
0000260    w   e   r   e       e   x   t   r   a   c   t   e   d       b
0000300    y       s   e   a   r   c   h   i   n   g       M   e   d   l
0000320    i   n   e       (   N   a   t   i   o   n   a   l       L   i
0000340    b   r   a   r   y       o   f       M   e   d   i   c   i   n
0000360    e   ,       B   e   t   h   e   s   d   a   ,       M   a   r
0000400    y   l   a   n   d   ,       U   S   A   )       a   n   d    
0000420    P   u   b   M   e   d       u   s   i   n   g       t   h   e
0000440        p   h   r   a   s   e   s       “  **  **      \n       <
0000460    i   t   a   l   i   c   >   e   n   t   e   r   o   v   i   r
0000500    u   s   -   A   7   1   <   /   i   t   a   l   i   c   >    
0000520    ”  **  **       a   n   d       “  **  **   m   o   l   e   c
0000540    u   l   a   r       e   p   i   d   e   m   i   o   l   o   g
0000560    y   ”  **  **       o   r       t   h   e       k   e   y    
0000600    w   o   r   d   s       “  **  **   p   a   t   h   o   g   e
0000620    n   e   s   i   s   ”  **  **       o   r       “  **  **   v
0000640    a   c   c   i   n   e   .   ”  **  **       T   h   e       r
0000660    e   s   u   l   t   s       w   e   r   e       l   i   m   i
0000700    t   e   d       t   o       m   a   n   u   s   c   r   i   p
0000720    t   s       a   v   a   i   l   a   b   l   e       i   n    
0000740    E   n   g   l   i   s   h   .  \n   <   /   p   >  \n        
0000756

It appears to have picked up the double quotes correctly.

petermr commented 3 years ago

It may be that there is a preferred encoding on different machines. Here is a helpfule article http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html although it suggests that some system s may try to take intelligent action (which may may things worse).

ayush4921 commented 3 years ago

To recreate this error I pasted the contents of file to a file named demo.xml on my system (windows 10, python 3.7.1)

The code

import os
with open('demo.xml', "r",encoding='utf-8') as f:
    print("read", f.read())

Gave the following output:

read <?xml version="1.0" encoding="UTF-8"?>
<p>All manuscripts used in this review were published between January 1965 and December 2017; these reports related to EV-A71 infections were extracted by searching Medline (National Library of Medicine, Bethesda, Maryland, USA) and PubMed using the phrases " 
 <italic>enterovirus-A71</italic> " and "molecular epidemiology" or the key words "pathogenesis" or "vaccine." The results were limited to manuscripts available in English.
</p>

Whereas the code:

import os
with open('demo.xml', "r") as f:
    print("read", f.read())

gave the following error:

  UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-c1e73f28f3e6> in <module>
      1 import os
      2 with open('demo.xml', "r") as f:
----> 3     print("read", f.read())

~\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 340: character maps to <undefined>
petermr commented 3 years ago

Thanks, That's much clearer now that it's in one place.

Now can you read it into ET.parse which is where the error is taking place? Because that doesn't have an encoding argument. It seems clear that your system is defaulting to CP1252 whereas mine will use UTF-8.

ayush4921 commented 3 years ago

I recommend explicitly mention encoding in every read argument

petermr commented 3 years ago

Yes, but ET.parse doesn't have one IIRC.

ayush4921 commented 3 years ago

import xml.etree.ElementTree as ET xmlp = ET.XMLParser(encoding="utf-8") f = ET.parse(file,parser=xmlp)

ayush4921 commented 3 years ago

You can also do ET.fromstring() after reading from with open(file,encoding='utf-8')

petermr commented 3 years ago

Thanks, I think the XMLParser will do it. I don't like converting to strings if it can be avoided as it's easy to get corruption. Can you test the XMLParser? I will also put it in the code...

ayush4921 commented 3 years ago
import xml.etree.ElementTree as ET
xmlp = ET.XMLParser(encoding="utf-8")
f = ET.parse('demo.xml',parser=xmlp)
root=f.getroot()
abc=root.text
print(abc)

This gives the output:

All manuscripts used in this review were published between January 1965 and December 2017; these reports related to EV-A71 infections were extracted by searching Medline (National Library of Medicine, Bethesda, Maryland, USA) and PubMed using the phrases “

Which is the required output