titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

Error: parse_medline_xml() is unable to parse the file even though the provided path is correct #113

Closed srishti-git1110 closed 2 years ago

srishti-git1110 commented 2 years ago

Hi, I'm using parse_medline_xml() to parse an xml file; I'm not sure where the error stems from. I read a discussion on a similar issue that was raised in the past & cross checked if the file path I'm providing is right & it seems right. Is there any other reason because of which the following error could occur-

Edit : I tried parsing a 2017 file using the exact same code & it worked fine. A similar discussion asks to install the latest version of pubmed parser so I did that but it's still not working. I'm trying to do this for a 2022 file.

Error: it was not able to read a path, a file-like object, or a string as an XML Traceback (most recent call last):

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\pubmed_parser-0.3.1-py3.9.egg\pubmed_parser\utils.py", line 31, in read_xml tree = etree.parse(path)

File "src\lxml\etree.pyx", line 3536, in lxml.etree.parse

File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseDocument

File "src\lxml\parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL

File "src\lxml\parser.pxi", line 1805, in lxml.etree._parseDocFromFile

File "src\lxml\parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile

File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult

File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError

File "file:/C:/Users/hp/Desktop/scratch/Rudraksh/food-disease-relx/data/baseline_test_sg/pubmed22n0002.xml.gz", line 577522 XMLSyntaxError: Specification mandates value for attribute CitedMed, line 577522, column 33

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\IPython\core\interactiveshell.py", line 3457, in run_code exec(code_obj, self.user_global_ns, self.user_ns)

File "C:\Users\hp\AppData\Local\Temp/ipykernel_3596/2675774551.py", line 1, in parsed_file = pp.parse_medline_xml(r'C:\Users\hp\Desktop\scratch\Rudraksh\food-disease-relx\data\baseline_test_sg\pubmed22n0002.xml.gz')

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\pubmed_parser-0.3.1-py3.9.egg\pubmed_parser\medline_parser.py", line 672, in parse_medline_xml tree = read_xml(path)

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\pubmed_parser-0.3.1-py3.9.egg\pubmed_parser\utils.py", line 36, in read_xml tree = etree.fromstring(path)

File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring

File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument

File "src\lxml\parser.pxi", line 1793, in lxml.etree._parseDoc

File "src\lxml\parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc

File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult

File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError

File "", line 1 XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

titipata commented 2 years ago

Yes, it seems like the file that you're putting in is not parsable by lxml.

srishti-git1110 commented 2 years ago

Thanks for taking time to answer. So, you are saying the parser won't work for files from the year 2022? Or is there any other issue apart from the date of file? Because it is working just fine for a 2017 file (downloaded from the exact same source) with the same extension .xml.gz

If year is the only issue, then do you have any idea till which year/date the parser shall work?

titipata commented 2 years ago

Oh, if it works until 2017. It might be the problem with the file format. I don't have much time to check the format but there might be an issue there!

raypereda-gr commented 2 years ago

In the last year, I have used parse_medline_xml() on all of the PubMed XML files without error. In general, I use the xml.gz file format but I have tested the .xml file too. I recommend stepping through the code while parsing that file in a debugger and isolating the error.

srishti-git1110 commented 2 years ago

Thanks @raypereda-gr. Yes, I'm also using it with a .xml.gz file. It's a 2022 file.

I tried debugging - However, I'm unable to figure out the error. Can you please help?

_> c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie() 333 if filename is not None: 334 msg = '{} for {!r}'.format(msg, filename) --> 335 raise SyntaxError(msg) 336 337 match = cookie_re.match(line_string)

ERROR! Session/line number was not unique in database. History logging moved to new session 157 ipdb> w c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\async_helpers.py(78)_pseudo_sync_runner() 76 """ 77 try: ---> 78 coro.send(None) 79 except StopIteration as exc: 80 return exc.value

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3185)run_cell_async() 3183 interactivity = 'async' 3184 -> 3185 has_raised = await self.run_ast_nodes(code_ast.body, cell_name, 3186 interactivity=interactivity, compiler=compiler, result=result) 3187

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3396)run_ast_nodes() 3394 if result: 3395 result.error_before_exec = sys.exc_info()[1] -> 3396 self.showtraceback() 3397 return True 3398

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2063)showtraceback() 2061 # Though this won't be called by syntax errors in the input 2062 # line, there may be SyntaxError cases with imported code. -> 2063 self.showsyntaxerror(filename, running_compiled_code) 2064 elif etype is UsageError: 2065 self.show_usage_error(value)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2129)showsyntaxerror() 2127 # If the error occurred when executing compiled code, we should provide full stacktrace. 2128 elist = traceback.extract_tb(last_traceback) if running_compiled_code else [] -> 2129 stb = self.SyntaxTB.structured_traceback(etype, value, elist) 2130 self._showtraceback(etype, value, stb) 2131

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback() 1405 value.text = newtext 1406 self.last_syntax_error = value -> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist, 1408 tb_offset=tb_offset, context=context) 1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback() 627 chained_exceptions_tb_offset = 0 628 out_list = ( --> 629 self.structured_traceback( 630 etype, evalue, (etb, chained_exc_ids), 631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback() 1405 value.text = newtext 1406 self.last_syntax_error = value -> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist, 1408 tb_offset=tb_offset, context=context) 1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback() 627 chained_exceptions_tb_offset = 0 628 out_list = ( --> 629 self.structured_traceback( 630 etype, evalue, (etb, chained_exc_ids), 631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback() 1405 value.text = newtext 1406 self.last_syntax_error = value -> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist, 1408 tb_offset=tb_offset, context=context) 1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback() 627 chained_exceptions_tb_offset = 0 628 out_list = ( --> 629 self.structured_traceback( 630 etype, evalue, (etb, chained_exc_ids), 631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1403)structured_traceback() 1401 and isinstance(value.lineno, int): 1402 linecache.checkcache(value.filename) -> 1403 newtext = linecache.getline(value.filename, value.lineno) 1404 if newtext: 1405 value.text = newtext

c:\users\hp\anaconda3\envs\test\lib\linecache.py(30)getline() 28 Update the cache if it doesn't contain an entry for this file already.""" 29 ---> 30 lines = getlines(filename, module_globals) 31 if 1 <= lineno <= len(lines): 32 return lines[lineno - 1]

c:\users\hp\anaconda3\envs\test\lib\linecache.py(46)getlines() 44 45 try: ---> 46 return updatecache(filename, module_globals) 47 except MemoryError: 48 clearcache()

c:\users\hp\anaconda3\envs\test\lib\linecache.py(136)updatecache() 134 return [] 135 try: --> 136 with tokenize.open(fullname) as fp: 137 lines = fp.readlines() 138 except OSError:

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(394)open() 392 buffer = _builtin_open(filename, 'rb') 393 try: --> 394 encoding, lines = detect_encoding(buffer.readline) 395 buffer.seek(0) 396 text = TextIOWrapper(buffer, encoding, line_buffering=True)

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(371)detect_encoding() 369 return default, [] 370 --> 371 encoding = find_cookie(first) 372 if encoding: 373 return encoding, [first]

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie() 333 if filename is not None: 334 msg = '{} for {!r}'.format(msg, filename) --> 335 raise SyntaxError(msg) 336 337 match = cookie_re.match(linestring)

srishti-git1110 commented 2 years ago

@raypereda-gr can you also please let me know the source and code you are downloading the files from?

here is my code - I'm afraid if incorrect files are getting downloaded on my end hence causing errors.

save_loc = 'Desktop/scratch/'
def download_ftp_files(link, save_loc, verbose=True):
     """ Downloads all ftp files from the supplied link """

    process = Popen(['wget', link + "*"],
                    stdout=PIPE, cwd=save_loc)

    if verbose:
        for line in iter(process.stdout.readline, ''):
            sys.stdout.write(line)

download_ftp_files('ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/', save_loc=save_loc + 'baseline/')
raypereda-gr commented 2 years ago

I will look into it further in a couple days. In the meantime, you can help me with two things. First, trim down the file to creating the smallest file that gives the same error. You will need to work with XML, not the zipped file. Second, try downloading the file in various ways. Try manual downloads too. See if the file changes with different ways of downloading.

titipata commented 2 years ago

Thank you so much @raypereda-gr for helping out!

srishti-git1110 commented 2 years ago

@raypereda-gr Thanks very much for considering to help.

As you asked to work with the .xml and not .xml.gz (zipped) file, is it required for trimming the file down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with. I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here. Sorry for bothering you too much. I'm very new to this hence the naivety.

raypereda-gr commented 2 years ago

As you asked to work with the .xml and not .xml.gz (zipped) file, is it required for trimming the file down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

That is the same function that I use:

list_of_dictionary = pp.parse_medline_xml(pubmed_xml_filename, year_info_only=False)

That function will accept a .xml or .xml.gz file. You don't need to worry about unzipping explicity; the function with handle that if needed.

Since you have been able to to parse the .xml.gz file, we can be confident that the problem is with the .xml file. How exactly did you unzip it? Here's ls output of the the unzipped file that I created by unzipping on a Mac using the pre-installed unzip tool. I also counted the number of lines.

$ ls -l medline17n0116.xml
-rw-r--r--@ 1 raypereda  staff  188634668 Mar 19 16:41 medline17n0116.xml

$ wc *.xml
 4572705 10113718 188634668 medline17n0116.xml

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Good. That means we can be confident that the problem is not with the download. I suspect something is off with the unzipping.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with. I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here.

Ok, why can you just parse the .xml.gz file? I would suggest not worry about unzipping the files.

srishti-git1110 commented 2 years ago

Thanks @raypereda-gr !

Yes, I was working with the zipped file only (.xml.gz) ; it still wasn't working.

I made a small change by just adding the keyword arg path while calling the function like so - pp.parse_medline_xml(path = pubmed_xml_filepath)

instead of positional calling like - pp.parse_medline_xml(pubmed_xml_filepath)

and it worked hence. Anyway, thanks a lot for helping patiently, @titipata @raypereda-gr. Best, Srishti