petermr / openDiagram

Extaction of semantic data from diagrams in scientific and other technical/business documents
Apache License 2.0
1 stars 5 forks source link

search_lib for the dictionary Activity with Minicorpora Activity #13

Open Radhu903 opened 3 years ago

Radhu903 commented 3 years ago
C:\Users\DELL\openDiagram\physchem\python>python search_lib.py --dict activity --sect introduction method --proj activity
running search main
project files not available for  C:\Users\DELL\openDiagram\python\diagrams\satish\cct
project files not available for  C:\Users\DELL\openDiagram\python\diagrams\rahul\diffprotexp
project files not available for  C:\Users\DELL\openVirus\miniproject\disease\1-part
project files not available for  C:\Users\DELL\worcester\synthesis
project files not available for  C:\Users\DELL\worcester\explosion
Failed to read dictionary C:\Users\DELL\CEVOpen\dictionary\eoCompound\plant_compound.xml Start tag expected, '<' not found, line 1, column 1 (file:/C:/Users/DELL/CEVOpen/dictionary/eoCompound/plant_compound.xml, line 1)
Failed to read dictionary C:\Users\DELL\CEVOpen\dictionary\eoPlant\Plant.xml Opening and ending tag mismatch: entry line 91 and dictionary, line 2399, column 14 (file:/C:/Users/DELL/CEVOpen/dictionary/eoPlant/Plant.xml, line 2399)
Failed to read dictionary C:\Users\DELL\CEVOpen\dictionary\eoCompound\plant_compound.xml Start tag expected, '<' not found, line 1, column 1 (file:/C:/Users/DELL/CEVOpen/dictionary/eoCompound/plant_compound.xml, line 1)
core dicts dict_keys(['activity', 'country', 'disease', 'plant_genus', 'organization', 'plant_part', 'invasive_plant'])
commandline args
dicts ['activity'] <class 'list'>
sects ['introduction', 'method'] <class 'list'>
projs ['activity'] <class 'list'>
patterns None <class 'NoneType'>
args> Namespace(dict=['activity'], sect=['introduction', 'method'], proj=['activity'], patt=None, demo=None, loglevel='foo', plot=True, nosearch=False, maxbars=25, languages=['en'])
name activity
***** project C:\Users\DELL\CEVOpen\minicorpora\activity
_DESC <class 'str'> introduction or background; looks for these words anywhere in file titles
PROJ <class 'str'> C:\Users\DELL\CEVOpen\minicorpora\activity
TREE <class 'str'> *
SECTS <class 'str'> **
SUBSECT <class 'str'> *introduction*
SUBSUB <class 'str'> **
FILE <class 'str'> *
SUFFIX <class 'str'> xml
glob C:\Users\DELL\CEVOpen\minicorpora\activity/*/**/*introduction*/**/*.xml
_DESC <class 'str'> introduction or background; looks for these words anywhere in file titles
PROJ <class 'str'> C:\Users\DELL\CEVOpen\minicorpora\activity
TREE <class 'str'> *
SECTS <class 'str'> **
SUBSECT <class 'str'> *background*
SUBSUB <class 'str'> **
FILE <class 'str'> *
SUFFIX <class 'str'> xml
glob C:\Users\DELL\CEVOpen\minicorpora\activity/*/**/*background*/**/*.xml
files 1203
***** section_files introduction 1203
file C:\Users\DELL\CEVOpen\minicorpora\activity\PMC7210559\sections\1_body\0_1__introduction\0_title.xml
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
lang: en
 [('antioxidant', 181), ('antifungal', 111), ('cosmetics', 15), ('analgesic', 15), ('antiseptic', 12), ('derivative', 10), ('antiprotozoal', 8), ('cytotoxicity', 7), ('antispasmodic', 5), ('perfume', 5), ('antiparasitic', 5), ('antiemetic', 4), ('anxiolytic', 4), ('fungicide', 3), ('antitussive', 3), ('diuretic', 3), ('insecticide', 2), ('phytotoxicity', 2), ('aphrodisiac', 2), ('larvicide', 2), ('irritant', 1), ('antipyretic', 1), ('immunomodulator', 1), ('anticoagulant', 1), ('bronchodilator', 1), ('anthelmintic', 1), ('sedative', 1), ('carcinogen', 1), ('choleretic', 1), ('stomachic', 1), ('astringent', 1), ('hypolipidemic', 1), ('pesticide', 1), ('anaphylactic', 1), ('adenosine', 1), ('antimalarial', 1), ('photosensitizer', 1)]
_DESC <class 'str'> methods and/or materials; looks for these words anywhere in file titles
PROJ <class 'str'> C:\Users\DELL\CEVOpen\minicorpora\activity
TREE <class 'str'> *
SECTS <class 'str'> **
SUBSECT <class 'str'> *method*
SUBSUB <class 'str'> **
FILE <class 'str'> *p
SUFFIX <class 'str'> xml
glob C:\Users\DELL\CEVOpen\minicorpora\activity/*/**/*method*/**/*p.xml
_DESC <class 'str'> methods and/or materials; looks for these words anywhere in file titles
PROJ <class 'str'> C:\Users\DELL\CEVOpen\minicorpora\activity
TREE <class 'str'> *
SECTS <class 'str'> **
SUBSECT <class 'str'> *material*
SUBSUB <class 'str'> **
FILE <class 'str'> *p
SUFFIX <class 'str'> xml
glob C:\Users\DELL\CEVOpen\minicorpora\activity/*/**/*material*/**/*p.xml
files 3148
***** section_files method 3148
file C:\Users\DELL\CEVOpen\minicorpora\activity\PMC7210559\sections\1_body\1_2__materials_and_methods\1_2_1__materials\1_p.xml
Traceback (most recent call last):
  File "C:\Users\DELL\openDiagram\physchem\python\search_lib.py", line 955, in <module>
    main()
  File "C:\Users\DELL\openDiagram\physchem\python\search_lib.py", line 918, in main
    ami_search.run_search()
  File "C:\Users\DELL\openDiagram\physchem\python\search_lib.py", line 340, in run_search
    self.find_files_search_plot(proj, section_type)
  File "C:\Users\DELL\openDiagram\physchem\python\search_lib.py", line 347, in find_files_search_plot
    counter_dict, pattern_dict = self.search_and_count(section_files)
  File "C:\Users\DELL\openDiagram\physchem\python\search_lib.py", line 294, in search_and_count
    matches_by_amidict, matches_by_pattern = self.search(target_file)
  File "C:\Users\DELL\openDiagram\physchem\python\search_lib.py", line 210, in search
    words = TextUtil.get_words_in_section(file)
  File "C:\Users\DELL\openDiagram\physchem\python\text_lib.py", line 447, in get_words_in_section
    section.read_file(file)
  File "C:\Users\DELL\openDiagram\physchem\python\text_lib.py", line 251, in read_file
    self.sentences = [Sentence(s) for s in (nltk.sent_tokenize(self.txt))]
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
    tokenizer = load("tokenizers/punkt/{0}.pickle".format(language))
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\data.py", line 752, in load
    opened_resource = _open(resource_url)
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\data.py", line 877, in _open
    return find(path_, path + [""]).open()
  File "C:\Users\DELL\AppData\Local\Programs\Python\Python39\lib\site-packages\nltk\data.py", line 585, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource ←[93mpunkt←[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  ←[31m>>> import nltk
  >>> nltk.download('punkt')
  ←[0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load ←[93mtokenizers/punkt/english.pickle←[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python39\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python39\\share\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python39\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************