titu1994 / PyCTakesParser

Utilities to parse the output of cTAKES
MIT License
10 stars 5 forks source link

Error in parsing xmi file obtained from CTakes #1

Open rishabhjoshi opened 4 years ago

rishabhjoshi commented 4 years ago

Hi, I got an xmi file from ctakes 4.0.0 which when i parse using ctakes_parser, I get the following error:

KeyError                                  Traceback (most recent call last)
~/envs/myenv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'true_text'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
~/envs/myenv/lib/python3.7/site-packages/pandas/core/internals/managers.py in set(self, item, value)
   1068         try:
-> 1069             loc = self.items.get_loc(item)
   1070         except KeyError:

~/envs/myenv/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'true_text'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-12c5201064fe> in <module>
----> 1 df = parser.parse_file('./../med_linkers/ctakes/apache-ctakes-4.0.0/temp_1_xmi_folder/282251.txt.xmi')

~/envs/myenv/lib/python3.7/site-packages/ctakes_parser/ctakes_parser.py in parse_file(file_path)
    125         return pos_rows
    126
--> 127     results['true_text'] = results.apply(_positional_search, axis=1)
    128
    129     pos_df_subset = positions[['part_of_speech', 'pos_start']]

~/envs/myenv/lib/python3.7/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3470         else:
   3471             # set column
-> 3472             self._set_item(key, value)
   3473
   3474     def _setitem_slice(self, key, value):

~/envs/myenv/lib/python3.7/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3548         self._ensure_valid_index(value)
   3549         value = self._sanitize_column(key, value)
-> 3550         NDFrame._set_item(self, key, value)
   3551
   3552         # check if we are modifying a copy

~/envs/myenv/lib/python3.7/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3379
   3380     def _set_item(self, key, value):
-> 3381         self._data.set(key, value)
   3382         self._clear_item_cache()
   3383

~/envs/myenv/lib/python3.7/site-packages/pandas/core/internals/managers.py in set(self, item, value)
   1070         except KeyError:
   1071             # This item wasn't present, just insert at end
-> 1072             self.insert(len(self.items), item, value)
   1073             return
   1074

~/envs/myenv/lib/python3.7/site-packages/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1179         new_axis = self.items.insert(loc, item)
   1180
-> 1181         block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
   1182
   1183         for blkno, count in _fast_count_smallints(self._blknos[loc:]):

~/envs/myenv/lib/python3.7/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath)
   3265         values = DatetimeArray._simple_new(values, dtype=dtype)
   3266
-> 3267     return klass(values, ndim=ndim, placement=placement)
   3268
   3269

~/envs/myenv/lib/python3.7/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    126             raise ValueError(
    127                 "Wrong number of items passed {val}, placement implies "
--> 128                 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
    129             )
    130

ValueError: Wrong number of items passed 0, placement implies 1

Any ideas?

rishabhjoshi commented 4 years ago

It seems that this error is coming for those files which don't have a "cui" output from CTakes. <refsem:UmlsConcept xmi:id="xxx" codingScheme="SNOMEDCT_US" code="yyyyyyyy" score="0.0" disambiguated="false" cui="Czzzzzzz" tui="Twww" preferredText="TEXT"/>

It would be great if a similar check can be done in the library and relevant output be generated.

titu1994 commented 4 years ago

I'm not particularly familiar with CTakes outputs, could you submit a PR to perform this check and show a warning instead ?