Open Kaartik7 opened 2 years ago
Although it does install the papers and makes Cproject, but I get this error message after the command finishes executing
Additional info that might help understand the issue : I get this error message "docanalysis: error: unrecognized arguments: --run_sectioning" when I try to section the papers
I have just run this:
pm286macbook:awena-wikidata-crawler pm286$ docanalysis --help
/opt/anaconda3/lib/python3.8/site-packages/_distutils_hack/init.py:36: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
usage: docanalysis [-h] [--run_pygetpapers] [--run_sectioning] [-q QUERY] [-k HITS]
[--project_name PROJECT_NAME] [-d DICTIONARY] [-o OUTPUT]
[--make_ami_dict MAKE_AMI_DICT] [-l LOGLEVEL] [-f
LOGFILE]
[--section [SECTION [SECTION ...]]] [--entities
[ENTITIES [ENTITIES ...]]]
[--spacy_model SPACY_MODEL] [--html HTML]
Welcome to Docanalysis version 0.0.7. -h or --help for help
optional arguments:
-h, --help show this help message and exit
--run_pygetpapers queries EuropePMC via pygetpapers
--run_sectioning make sections
-q QUERY, --query QUERY
query to pygetpapers
-k HITS, --hits HITS numbers of papers to download from pygetpapers
--project_name PROJECT_NAME
name of CProject folder
-d DICTIONARY, --dictionary DICTIONARY
Ami Dictionary to tag sentences and support
supervised entity
extraction
-o OUTPUT, --output OUTPUT
Output CSV file [default=entities.csv]
--make_ami_dict MAKE_AMI_DICT
if provided will make ami dict with given title
-l LOGLEVEL, --loglevel LOGLEVEL
[All] Provide logging level. Example --log warning
<<info,warning,debug,error,critical>>,
default='info'
-f LOGFILE, --logfile LOGFILE
[All] save log to specified file in output
directory as well as
printing to terminal
--section [SECTION [SECTION ...]]
Which section to get
--entities [ENTITIES [ENTITIES ...]]
Which entities to get. Default(ALL)
--spacy_model SPACY_MODEL
Optional. (spacy, scispacy). Default(spacy)
--html HTML Saves output in html format to given path
[...]
(base) pm286macbook:projects pm286$ docanalysis -q "lantana" -k 5 --run_pygetpapers --run_sectioning
/opt/anaconda3/lib/python3.8/site-packages/_distutils_hack/init.py:36: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INFO: making project/searching lantana for 5 hits into /Users/pm286/projects/2022_05_29_09_04_26
INFO: Total Hits are 2174
1it [00:00, 323.31it/s]
INFO: Saving XML files to /Users/pm286/projects/2022_05_29_09_04_26/*/fulltext.xml
100%|█████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 3.79it/s]
WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC9095257/fulltext.xml
INFO: dict_keys: dict_keys(['abstract', 'acknowledge', 'affiliation', 'author', 'conclusion', 'discussion', 'ethics', 'fig_caption', 'front', 'introduction', 'jrnl_title', 'keyword', 'method', 'octree', 'pdfimage', 'pub_date', 'publisher', 'reference', 'results_discuss', 'search_results', 'sections', 'svg', 'table', 'title'])
WARNING: loading templates.json
INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC9095257/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC9095257/sections
WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8933013/fulltext.xml
INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8933013/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8933013/sections
WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8879267/fulltext.xml
INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8879267/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8879267/sections
WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8593682/fulltext.xml
INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8593682/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8593682/sections
WARNING: Making sections in /Users/pm286/projects/2022_05_29_09_04_26/PMC8896935/fulltext.xml
INFO: wrote XML sections for /Users/pm286/projects/2022_05_29_09_04_26/PMC8896935/fulltext.xml /Users/pm286/projects/2022_05_29_09_04_26/PMC8896935/sections
INFO: starting tokenization on 1 paragraphs
100%|████████████████████████████████████████████████████████| 847/847 [00:01<00:00, 716.46it/s]
INFO: Found 2610 sentences
INFO: getting terms from/to False
INFO: Loading spacy
100%|██████████████████████████████████████████████████████| 2610/2610 [00:14<00:00, 175.90it/s]
/opt/anaconda3/lib/python3.8/site-packages/docanalysis/entity_extraction.py:257: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will not be treated as literal strings when regex=True.
df[col] = df[col].astype(str).str.replace(
INFO: wrote output to /Users/pm286/projects/2022_05_29_09_04_26/entities.csv
(base) pm286macbook:projects pm286$ ls -lt | more
total 88200
drwxr-xr-x 9 pm286 staff 288 29 May 10:04 2022_05_29_09_04_26
drwxr-xr-x 21 pm286 staff 672 22 May 18:29 presentations
[...]
(base) pm286macbook:projects pm286$ tree 2022_05_29_09_04_26/ | more
2022_05_29_09_04_26/
├── PMC8593682
│ ├── eupmc_result.json
│ ├── fulltext.xml
│ └── sections
│ ├── 0_processing-meta
│ │ └── 0_restricted-by.xml
│ ├── 1_front
│ │ ├── 0_journal-meta
│ │ │ ├── 0_journal-id.xml
│ │ │ ├── 1_journal-id.xml
│ │ │ ├── 2_journal-id.xml
│ │ │ ├── 3_journal-title-group.xml
│ │ │ ├── 4_issn.xml
│ │ │ └── 5_publisher.xml
│ │ └── 1_article-meta
│ │ ├── 0_article-id.xml
│ │ ├── 10_pub-date.xml
│ │ ├── 11_pub-date.xml
│ │ ├── 12_pub-date.xml
│ │ ├── 13_volume.xml
│ │ ├── 14_issue.xml
│ │ ├── 15_elocation-id.xml
│ │ ├── 16_history.xml
│ │ ├── 17_permissions.xml
│ │ ├── 18_self-uri.xml
│ │ ├── 19_abstract.xml
│ │ ├── 1_article-id.xml
│ │ ├── 20_kwd-group.xml
│ │ ├── 21_funding-group
│ │ │ ├── 0_award-group
│ │ │ │ ├── 0_funding-source
│ │ │ │ │ └── 0_institution-wrap
│ │ │ │ │ ├── 0_institution.xml
│ │ │ │ │ └── 1_institution-id.xml
│ │ │ │ ├── 1_award-id.xml
│ │ │ │ ├── 2_principal-award-recipient
│ │ │ │ │ └── 0_name.xml
│ │ │ │ ├── 3_principal-award-recipient
│ │ │ │ │ └── 0_name.xml
│ │ │ │ ├── 4_principal-award-recipient
│ │ │ │ │ └── 0_name.xml
│ │ │ │ ├── 5_principal-award-recipient
│ │ │ │ │ └── 0_name.xml
│ │ │ │ ├── 6_principal-award-recipient
│ │ │ │ │ └── 0_name.xml
│ │ │ │ ├── 7_principal-award-recipient
│ │ │ │ │ └── 0_name.xml
│ │ │ │ └── 8_principal-award-recipient
│ │ │ │ └── 0_name.xml
│ │ │ ├── 1_award-group
[...]
So it works for me , although pip install seems to give version 0.0.7
Shweata, any thoughts?
P.
On Sun, May 29, 2022 at 9:21 AM Kaartik7 @.***> wrote:
Additional info that might help understand the issue : I get this error message "docanalysis: error: unrecognized arguments: --run_sectioning" when I try to section the papers
— Reply to this email directly, view it on GitHub https://github.com/petermr/docanalysis/issues/13#issuecomment-1140401577, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS4JHHVUA6IWRABUHYDVMMSHPANCNFSM5XHZYWGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
When I run the following command in terminal on my mac - docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10, I run into the above mentioned error. Kindly help me with it