petermr / docanalysis

Semantic analysis of text documents including sentence and paragraph splitting
Apache License 2.0
13 stars 3 forks source link

ERROR: section papers using --run_sectioning before search #13

Open Kaartik7 opened 2 years ago

Kaartik7 commented 2 years ago

When I run the following command in terminal on my mac - docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10, I run into the above mentioned error. Kindly help me with it

Kaartik7 commented 2 years ago

Although it does install the papers and makes Cproject, but I get this error message after the command finishes executing

Kaartik7 commented 2 years ago

Additional info that might help understand the issue : I get this error message "docanalysis: error: unrecognized arguments: --run_sectioning" when I try to section the papers

petermr commented 2 years ago

I have just run this:

pm286macbook:awena-wikidata-crawler pm286$ docanalysis --help

/opt/anaconda3/lib/python3.8/site-packages/_distutils_hack/init.py:36: UserWarning: Setuptools is replacing distutils.

warnings.warn("Setuptools is replacing distutils.")

usage: docanalysis [-h] [--run_pygetpapers] [--run_sectioning] [-q QUERY] [-k HITS]

               [--project_name PROJECT_NAME] [-d DICTIONARY] [-o OUTPUT]

               [--make_ami_dict MAKE_AMI_DICT] [-l LOGLEVEL] [-f

LOGFILE]

               [--section [SECTION [SECTION ...]]] [--entities

[ENTITIES [ENTITIES ...]]]

               [--spacy_model SPACY_MODEL] [--html HTML]

Welcome to Docanalysis version 0.0.7. -h or --help for help

optional arguments:

-h, --help show this help message and exit

--run_pygetpapers queries EuropePMC via pygetpapers

--run_sectioning make sections

-q QUERY, --query QUERY

                    query to pygetpapers

-k HITS, --hits HITS numbers of papers to download from pygetpapers

--project_name PROJECT_NAME

                    name of CProject folder

-d DICTIONARY, --dictionary DICTIONARY

                    Ami Dictionary to tag sentences and support

supervised entity

                    extraction

-o OUTPUT, --output OUTPUT

                    Output CSV file [default=entities.csv]

--make_ami_dict MAKE_AMI_DICT

                    if provided will make ami dict with given title

-l LOGLEVEL, --loglevel LOGLEVEL

                    [All] Provide logging level. Example --log warning

                    <<info,warning,debug,error,critical>>,

default='info'

-f LOGFILE, --logfile LOGFILE

                    [All] save log to specified file in output

directory as well as

                    printing to terminal

--section [SECTION [SECTION ...]]

                    Which section to get

--entities [ENTITIES [ENTITIES ...]]

                    Which entities to get. Default(ALL)

--spacy_model SPACY_MODEL

                    Optional. (spacy, scispacy). Default(spacy)

--html HTML Saves output in html format to given path

[...]

(base) pm286macbook:projects pm286$ docanalysis -q "lantana" -k 5 --run_pygetpapers --run_sectioning

/opt/anaconda3/lib/python3.8/site-packages/_distutils_hack/init.py:36: UserWarning: Setuptools is replacing distutils.

warnings.warn("Setuptools is replacing distutils.")

INFO: making project/searching lantana for 5 hits into /Users/pm286/projects/2022_05_29_09_04_26

INFO: Total Hits are 2174

1it [00:00, 323.31it/s]

INFO: Saving XML files to /Users/pm286/projects/2022_05_29_09_04_26/*/fulltext.xml

100%|█████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 3.79it/s]

WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC9095257/fulltext.xml

INFO: dict_keys: dict_keys(['abstract', 'acknowledge', 'affiliation', 'author', 'conclusion', 'discussion', 'ethics', 'fig_caption', 'front', 'introduction', 'jrnl_title', 'keyword', 'method', 'octree', 'pdfimage', 'pub_date', 'publisher', 'reference', 'results_discuss', 'search_results', 'sections', 'svg', 'table', 'title'])

WARNING: loading templates.json

INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC9095257/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC9095257/sections

WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8933013/fulltext.xml

INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8933013/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8933013/sections

WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8879267/fulltext.xml

INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8879267/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8879267/sections

WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8593682/fulltext.xml

INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8593682/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8593682/sections

WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8896935/fulltext.xml

INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8896935/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8896935/sections

INFO: starting tokenization on 1 paragraphs

100%|████████████████████████████████████████████████████████| 847/847 [00:01<00:00, 716.46it/s]

INFO: Found 2610 sentences

INFO: getting terms from/to False

INFO: Loading spacy

100%|██████████████████████████████████████████████████████| 2610/2610 [00:14<00:00, 175.90it/s]

/opt/anaconda3/lib/python3.8/site-packages/docanalysis/entity_extraction.py:257: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will not be treated as literal strings when regex=True.

df[col] = df[col].astype(str).str.replace(

INFO: wrote output to /Users/pm286/projects/2022_05_29_09_04_26/entities.csv

(base) pm286macbook:projects pm286$ ls -lt | more

total 88200

drwxr-xr-x 9 pm286 staff 288 29 May 10:04 2022_05_29_09_04_26

drwxr-xr-x 21 pm286 staff 672 22 May 18:29 presentations

[...]

(base) pm286macbook:projects pm286$ tree 2022_05_29_09_04_26/ | more

2022_05_29_09_04_26/

├── PMC8593682

│ ├── eupmc_result.json

│ ├── fulltext.xml

│ └── sections

│ ├── 0_processing-meta

│ │ └── 0_restricted-by.xml

│ ├── 1_front

│ │ ├── 0_journal-meta

│ │ │ ├── 0_journal-id.xml

│ │ │ ├── 1_journal-id.xml

│ │ │ ├── 2_journal-id.xml

│ │ │ ├── 3_journal-title-group.xml

│ │ │ ├── 4_issn.xml

│ │ │ └── 5_publisher.xml

│ │ └── 1_article-meta

│ │ ├── 0_article-id.xml

│ │ ├── 10_pub-date.xml

│ │ ├── 11_pub-date.xml

│ │ ├── 12_pub-date.xml

│ │ ├── 13_volume.xml

│ │ ├── 14_issue.xml

│ │ ├── 15_elocation-id.xml

│ │ ├── 16_history.xml

│ │ ├── 17_permissions.xml

│ │ ├── 18_self-uri.xml

│ │ ├── 19_abstract.xml

│ │ ├── 1_article-id.xml

│ │ ├── 20_kwd-group.xml

│ │ ├── 21_funding-group

│ │ │ ├── 0_award-group

│ │ │ │ ├── 0_funding-source

│ │ │ │ │ └── 0_institution-wrap

│ │ │ │ │ ├── 0_institution.xml

│ │ │ │ │ └── 1_institution-id.xml

│ │ │ │ ├── 1_award-id.xml

│ │ │ │ ├── 2_principal-award-recipient

│ │ │ │ │ └── 0_name.xml

│ │ │ │ ├── 3_principal-award-recipient

│ │ │ │ │ └── 0_name.xml

│ │ │ │ ├── 4_principal-award-recipient

│ │ │ │ │ └── 0_name.xml

│ │ │ │ ├── 5_principal-award-recipient

│ │ │ │ │ └── 0_name.xml

│ │ │ │ ├── 6_principal-award-recipient

│ │ │ │ │ └── 0_name.xml

│ │ │ │ ├── 7_principal-award-recipient

│ │ │ │ │ └── 0_name.xml

│ │ │ │ └── 8_principal-award-recipient

│ │ │ │ └── 0_name.xml

│ │ │ ├── 1_award-group

[...]

So it works for me , although pip install seems to give version 0.0.7

Shweata, any thoughts?

P.

On Sun, May 29, 2022 at 9:21 AM Kaartik7 @.***> wrote:

Additional info that might help understand the issue : I get this error message "docanalysis: error: unrecognized arguments: --run_sectioning" when I try to section the papers

— Reply to this email directly, view it on GitHub https://github.com/petermr/docanalysis/issues/13#issuecomment-1140401577, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4JHHVUA6IWRABUHYDVMMSHPANCNFSM5XHZYWGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK