sidphbot / Auto-Research

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!
GNU General Public License v3.0
54 stars 6 forks source link
arxiv arxiv-api nlp ocr pdf-document-processor python pytorch research-and-development research-data-management research-software-engineering research-tool scientific-publications scientific-research summarization text-clustering text-generation text-similarity title-generation topic-modeling

title: Researcher emoji: 🤓 colorFrom: gray colorTo: pink sdk: streamlit sdk_version: 1.2.0 app_file: app.py pinned: false

Auto-Research

Auto-Research

A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

Data Provider: arXiv Open Archive Initiative OAI

Requirements:

Demo :

Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing

Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query

([TIP] click 'edit and run' to run the demo for your custom queries on a free GPU)

Installation:

sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
pip install git+https://github.com/sidphbot/Auto-Research.git

Run Survey (cli):

python survey.py [options] <your_research_query>

Run Survey (Streamlit web-interface - new):

streamlit run app.py

Run Survey (Python API):

from survey import Surveyor
mysurveyor = Surveyor()
mysurveyor.survey('quantum entanglement')

Research tools:

These are independent tools for your research or document text handling needs.

*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)

Access/Modify defaults:

pprint(DEFAULTS)

or,

- Modify static config file - `defaults.py`

or,

- At runtime (utility)

python survey.py --help

usage: survey.py [-h] [--max_search max_metadata_papers] [--num_papers max_num_papers] [--pdf_dir pdf_dir] [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir] [--dump_dir dump_dir] [--models_dir save_models_dir] [--title_model_name title_model_name] [--ex_summ_model_name extractive_summ_model_name] [--ledmodel_name ledmodel_name] [--embedder_name sentence_embedder_name] [--nlp_name spacy_model_name] [--similarity_nlp_name similarity_nlp_name] [--kw_model_name kw_model_name] [--refresh_models refresh_models] [--high_gpu high_gpu] query_string

Generate a survey just from a query !!

positional arguments: query_string your research query/keywords

optional arguments: -h, --help show this help message and exit --max_search max_metadata_papers maximium number of papers to gaze at - defaults to 100 --num_papers max_num_papers maximium number of papers to download and analyse - defaults to 25 --pdf_dir pdf_dir pdf paper storage directory - defaults to arxiv_data/tarpdfs/ --txt_dir txt_dir text-converted paper storage directory - defaults to arxiv_data/fulltext/ --img_dir img_dir image storage directory - defaults to arxiv_data/images/ --tab_dir tab_dir tables storage directory - defaults to arxiv_data/tables/ --dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/ --models_dir save_models_dir directory to save models (> 5GB) - defaults to saved_models/ --title_model_name title_model_name title model name/tag in hugging-face, defaults to 'Callidior/bert2bert-base-arxiv-titlegen' --ex_summ_model_name extractive_summ_model_name extractive summary model name/tag in hugging-face, defaults to 'allenai/scibert_scivocab_uncased' --ledmodel_name ledmodel_name led model(for abstractive summary) name/tag in hugging-face, defaults to 'allenai/led- large-16384-arxiv' --embedder_name sentence_embedder_name sentence embedder name/tag in hugging-face, defaults to 'paraphrase-MiniLM-L6-v2' --nlp_name spacy_model_name spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to 'en_core_sci_scibert' --similarity_nlp_name similarity_nlp_name spacy downstream model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to 'en_core_sci_lg' --kw_model_name kw_model_name keyword extraction model name/tag in hugging-face, defaults to 'distilbert-base-nli-mean-tokens' --refresh_models refresh_models Refresh model downloads with given names (needs atleast one model name param above), defaults to False --high_gpu high_gpu High GPU usage permitted, defaults to False



- At runtime (code)

    > during surveyor object initialization with `surveyor_obj = Surveyor()`
    - `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
    - `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
    - `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
    - `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
    - `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
    - `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
    - `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
    - `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
    - `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
    - `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
    - `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
    - `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
    - `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
    - `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
    - `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False

    > during survey generation with `surveyor_obj.survey(query="my_research_query")`
    - `max_search`: int maximium number of papers to gaze at - defaults to `100`
    - `num_papers`: int maximium number of papers to download and analyse - defaults to `25`

#### Artifacts generated (zipped):
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump

This work builds upon these fantastic models (for various nlp sub-tasks) out there for researchers and devs like us
https://huggingface.co/Callidior/bert2bert-base-arxiv-titlegen
https://huggingface.co/allenai/scibert_scivocab_uncased
https://huggingface.co/allenai/led-large-16384-arxiv
https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2
https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokens
https://tabula.technology/
https://spacy.io/ and https://allenai.github.io/scispacy/

Please cite this repo if it helped you :)